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Abstract 

Motivated by applications in large-scale knowledge base construction, we study the problem of scaling up 
a sophisticated statistical inference framework called Markov Logic Networks (MLNs). Our approach, Felix, 
uses the idea of Lagrangian relaxation from mathematical programming to decompose a program into smaller 
tasks while preserving the joint-inference property of the original MLN. The advantage is that we can use 
highly scalable specialized algorithms for common tasks such as classification and coreference. We propose an 
architecture to support Lagrangian relaxation in an RDBMS which we show enables scalable joint inference 
for MLNs. We empirically validate that Felix is significantly more scalable and efficient than prior approaches 
to MLN inference by constructing a knowledge base from 1.8M documents as part of the TAC challenge. We 
show that Felix scales and achieves state-of-the-art quality numbers. In contrast, prior approaches do not 
scale even to a subset of the corpus that is three orders of magnitude smaller. 

1 Introduction 

Building large-scale knowledge bases from text has recently received tremendous interest from academia (48), 
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e.g., CMU's NELL (8J, MPI's YAGO |2lJ|29j , and from industry, e.g., Microsoft's EntityCube [52], and IBM's 
Watson [17] . In their quest to extract knowledge from free- form text, a major problem that all these systems 
face is coping with inconsistency due to both conflicting information in the underlying sources and the difficulty 
for machines to understand natural language text. To cope with this challenge, each of the above systems uses 
statistical inference to resolve these ambiguities in a principled way. To support this, the research community has 
developed sophisticated statistical inference frameworks, e.g., PRMs [18] , BLOG (28], MLNs [34], SOFIE [43] ? 
— < Factorie |26]|, and LB J [36]. The key challenge with these systems is efficiency and scalability, and to develop 
the next generation of sophisticated text applications, we argue that a promising approach is to improve the 
efficiency and scalability of the above frameworks. 

To understand the challenges of scaling such frameworks, we focus on one popular such framework, called 
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Markov Logic Networks (MLNs), that has been successfully applied to many challenging text applications |4[|32 



43|52]. In Markov Logic one can write first-order logic rules with weights (that intuitively model our confidence 
in a rule) ; this allows a developer to capture rules that are likely, but not certain, to be correct. A key technical 
challenge has been the scalability of MLN inference. Not surprisingly, there has been intense research interest 
in techniques to improve the scalability and performance of MLNs, such as improving memory efficiency [42] , 
leveraging database technologies [30], and designing algorithms for special-purpose programs [4j[43]. Our work 
here continues this line of work. 

Our goal is to use Markov Logic to construct a structured database of facts and then answer questions 
like "which Bulgarian leaders attended Sofia University and when?" with provenance from text. (Our system, 
Felix, answers Georgi Parvanov and points to a handful of sentences in a corpus to demonstrate its answer.) 
During the iterative process of constructing such a knowledge base from text and then using that knowledge 
base to answer sophisticated questions, we have found that it is critical to efficiently process structured queries 
over large volumes of structured data. And so, we have built Felix on top of an RDBMS. However, as we verify 
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Figure 1: Felix breaks an input program, T, into several, smaller tasks (shown in Panel a), while prior 
approaches are monolithic (shown in Panel b). 

experimentally later in this paper, the scalability of previous RDBMS-based solutions to MLN inference [30] 
is still limited. Our key observation is that in many text processing applications, one must solve a handful 
of common subproblems, e.g., coreference resolution or classification. Some of these have been studied for 
decades, and so have specialized algorithms with higher scalability on these subproblems than the monolithic 
inference used by typical Markov Logic systems. Thus, our goal is to leverage the specialized algorithms 
for these subproblems to provide more scalable inference for general Markov Logic programs in an RDBMS. 
Figure [l] illustrates the difference at a high level between Felix and prior approaches: prior approaches, such 



as Alchemy |34| or Tuffy 30 , are monolithic in that they attack the entire MLN inference problem with one 
algorithm; in constrast, FELlxdecomposes the problem into several small tasks. 

To achieve this goal, we observe that the problem of inference in an MLN- and essentially any kind of 
statistical inference - can be cast as a mathematical optimization problem. Thus, we adapt techniques from 
the mathematical programming literature to MLN inference. In particular, we consider the idea of Lagrangian 
relaxation |6, p. 244] that allows one to decompose a complex optimization problem into multiple pieces that 



are hopefully easier to solve 37,511. Lagrangian relaxation is a widely deployed technique to cope with many 



difficult mathematical programming problems, and it is the theoretical underpinning of many state-of-the-art 



inference algorithms for graphical models, e.g., Belief Propagation 46 . In many - but not all - cases, a 
Lagrangian relaxation has the same optimal solution as the underlying original problem [6j[7j[5l]. At a high 
level, Lagrangian relaxation gives us a message-passing protocol that resolves inconsistencies among conflicting 
predictions to accomplish joint-inference. Our system, Felix, does not actually construct the mathematical 
program, but uses Lagrangian relaxation as a formal guide to decompose an MLN program into multiple tasks 
and construct an appropriate message-passing scheme. 

Our first technical contribution is an architecture to scalably perform MLN inference in an RDBMS using 
Lagrangian relaxation. Our architecture models each subproblem as a task that takes as input a set of relations, 
and outputs another set of relations. For example, our prototype of Felix implements specialized algorithms 
for classification and coreference resolution (coref); these tasks frequently occur in text-processing applications. 
By modeling tasks in this way, we are able to use SQL queries for all data movement in the system: both 
transforming the input data into an appropriate form for each task and encoding the message passing of 
Lagrangian relaxation between tasks. In turn, this allows Felix to leverage the mature, set-at-a-time processing 
power of an RBDMS to achieve scalability and efficiency. On all programs and datasets that we experimented 
with, our approach converges rapidly to the optimal solution of the Lagrangian relaxation. Our ultimate 
goal is to build high-quality applications, and we validate on several knowledge-base construction tasks that 
Felix achieves higher scalability and essentially identical result quality compared to prior MLN systems. More 
precisely, when prior MLN systems are able to scale, Felix converges to the same quality (and sometimes 
more efficiently). When prior MLN systems fail to scale, Felix can still produce high-quality results. We 
take this as evidence that Felix's approach is a promising direction to scale up large-scale statistical inference. 
Furthermore, we validate that being able to integrate specialized algorithms is crucial for Felix's scalability: 
after disabling specialized algorithms, Felix no longer scales to the same datasets. 

Although the RDBMS provides some level of scalability for data movement inside Felix, the scale of data 
passed between tasks (via SQL queries) may be staggering. The reason is that statistical algorithms may 
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Figure 2: An example MLN program that performs three tasks jointly: 1. discover affiliation relationships 
between people and organizations (affil); 2. resolve coreference among people mentions (pCoref); and 3. 
resolve coreference among organization mentions (oCoref ). The remaining eight relations are evidence relations. 
In particular, coOccurs stores person-organization co-occurrences; *Sim* relations are string similarities. 



produce huge numbers of combinations (say all pairs of potentially matching person mentions). The sheer sizes 
of intermediate results are often killers for scalability, e.g., the complete input to coreference resolution on an 
Enron dataset has 1.2 x 10 11 tuples. The saving grace is that a task may access the intermediate data in an 
on-demand manner. For example, a popular coref algorithm repeatedly asks "given a fixed word x, tell me all 
words that are likely to be coreferent with x." [3j[5]. Moreover, the algorithm only asks for a small fraction of 
such x. Thus, it would be wasteful to produce all possible matching pairs. Instead we can produce only those 
words that are needed on-demand (i.e., materialize them lazily). Felix considers a richer space of possible 
materialization strategies than simply eager or lazy: it can choose to eagerly materialize one or more subqueries 
responsible for data movement between tasks [33] . To make such decisions, Felix's second contribution is a 
novel cost model that leverages the cost-estimation facility in the RDBMS coupled with the data-access patterns 
of the tasks. On the Enron dataset, our cost-based approach finds execution plans that achieve two orders of 
magnitude speedup over eager materialization and 2-3X speedup compared to lazy materialization. 

Although Felix allows a user to provide any decomposition scheme, identifying decompositions could be 
difficult for some users, so we do not want to force users to specify a decomposition to use Felix. To support 
this, we need a compiler that performs task decomposition given a standard MLN program as input. Building 
on classical and new results in embedded dependency inference from the database theory literature [T|[2| [To|[l4| , 
we show that the underlying problem of compilation is n^P-complete in easier cases, and undecidable in more 
difficult cases. To cope, we develop a sound (but not complete) compiler that takes as input an ordinary MLN 
program, identifies common tasks such as classification and coref, and then assigns those tasks to specialized 
algorithms. 

To validate that our system can perform sophisticated knowledge-base construction tasks, we use the Felix 
system to implement a solution for the TAC-KBP (Knowledge Base Population) challenge^] Given a 1.8M 
document corpus, the goal is to perform two related tasks: (1) entity linking: extract all entity mentions and 
map them to entries in Wikipedia, and (2) slot filling: determine relationships between entities. The reason for 
choosing this task is that it contains ground truth so that we can assess the results: We achieved Fl=0.80 on 
entity linking (human performance is 0.90), and Fl=0.34 on slot filling (state-of-the-art quality)^] In addition 
to KBP, we also use three information extraction (IE) datasets that have state-of-the-art solutions. On all 
four datasets, we show that Felix is significantly more scalable than monolithic systems such as Tuffy and 
Alchemy; this in turn enables Felix to efficiently process sophisticated MLNs and produce high-quality re- 
sults. Furthermore, we validate that our individual technical contributions are crucial to the overall performance 
and quality of Felix. 

http : / / nip . cs . qc . cuny . edu/kbp/2010/ 

Fl is the harmonic mean of precision and recall. 
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Outline In Section [2| we describe related work. In Section [3j we describe a simple text application encoded 
as an MLN program, and the Lagrangian relaxation technique in mathematical programming. In Section |4j 
we present an overview of Felix's architecture and some key concepts. In Section [5| we describe key technical 
challenges and how Felix addresses them: how to execute individual tasks with high performance and quality, 
how to improve the data movement efficiency between tasks, and how to automatically recognize specialized 
tasks in an MLN program. In Section |6j we use extensive experiments to validate the overall advantage of 
Felix as well as individual technical contributions. 



2 Related Work 

There is a trend to build semantically deep text applications with increasingly sophisticated statistical in- 



ference 15,43,49,52 . We follow on this line of work. However, while the goal of prior work is to explore 
the effectiveness of different correlation structures on particular applications, our goal is to support general 
application development by scaling up existing statistical inference frameworks. Wang et al. |47 explore mul- 
tiple inference algorithms for information extraction. However, their system focuses on managing low-level 
extractions in CRF models, whereas our goal is to use MLN to support knowledge base construction. 

Felix specializes to MLNs. There are, however, other statistical inference frameworks such as PRMs [l8| , 
BLOG [28] , Factorie [26j[50] , and PrDB [40] . Our hope is that the techniques developed here apply to these 
frameworks as well. 

Researchers have proposed different approaches to improving MLN inference performance in the context of 
text applications. In StatSnowball [52] , Zhu et al. demonstrate high quality results of an MLN-based approach. 
To address the scalability issue of generic MLN inference, they make additional independence assumptions in 
their programs. In contrast, the goal of Felix is to automatically scale up statistical inference while sticking 
to MLN semantics. Theobald et al. [44] design specialized MaxSAT algorithms that efficiently solve MLN 
programs of special forms. In contrast, we study how to scale general MLN programs. Riedel [35] proposed a 
cutting-plane meta-algorithm that iteratively performs grounding and inference, but the underlying grounding 
and inference procedures are still for generic MLNs. In Tuffy |3Q] , the authors improve the scalability of MLN 
inference with an RDBMS, but their system is still a monolithic approach that consists of generic inference 
procedures. 

As a classic technique, Lagrangian relaxation has been applied to closely related statistical models (i.e., 
graphical models) |20,46|. However, there the input is directly a mathematical optimization problem and the 
granularity of decomposition is individual variables. In contrast, our input is a program in a high-level language, 
and we perform decomposition at the relation level inside an RDBMS. 



Our materialization tradeoff strategy is related to view materialization and selection 11,41] in the context 
of data warehousing. However, our problem setting is different: we focus on batch processing so that we do 
not consider maintenance cost. The idea of lazy-eager tradeoff in view materialization or query answering has 
also been applied to probabilistic databases [50]. However, their goal is efficiently maintaining intermediate 
results, rather than choosing a materialization strategy. Similar in spirit to our approach is Sprout pi], which 
considers lazy-versus-eager plans for when to apply confidence computation, but they do not consider inference 
decomposition. 



3 Preliminaries 

To illustrate how MLNs can be used in text-processing applications, we first walk through a program that ex- 
tracts affiliations between people and organizations from Web text. We then describe how Lagrangian relaxation 
is used for mathematical optimization. 
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3.1 Markov Logic Networks in Felix 

In text applications, a typical first step is to use standard NLP toolkits to generate raw data such as plausible 
mentions of people and organizations in a Web corpus and their co-occurrences. But transforming such raw 
signals into high-quality and semantically coherent knowledge bases is a challenging task. For example, a major 
challenge is that a single real- world entity may be referred to in many different ways, e.g., U UCB" and "UC- 
Berkeley". To address such challenges, MLNs provide a framework where we can express logical assertions that 
are only likely to be true (and quantify such likelihood). Below we explain the key concepts in this framework 
by walking through an example. 

Our system Felix is a middleware system: it takes as input a standard MLN program, performs statistical 
inference, and outputs its results into one or more relations that are stored in a relational database (PostgreSQL). 
An MLN program consists of three parts: schema, evidence, and rules. To tell Felix what data will be provided 
or generated, the user provides a schema. Some relations are standard database relations, and we call these 
relations evidence. Intuitively, evidence relations contain tuples that we assume are correct. In the schema of 
Figure [2j the first eight relations are evidence relations. For example, we know that 'Ullman' and 'Stanford 
Univ.' co-occur in some webpage, and that 'Doc201' is the homepage of 'Joe'. In addition to evidence relations, 
there are also relations whose content we do not know, but we want the MLN program to predict; they are 
called query relations. In Figure [2j affil is a query relation since we want the MLN to predict affiliation 
relationships between persons and organizations. The other two query relations are pCoref and oCoref , for 
person and organization coreference, respectively. 

In addition to schema and evidence, we also provide a set of MLN rules that encode our knowledge about 
the correlations and constraints over the relations. An MLN rule is a first-order logic formula associated with 
an extended-real-valued number called a weight. Infinite-weighted rules are called hard rules, which means that 
they must hold in any prediction that the MLN system makes. In contrast, rules with finite weights are soft 
rules: a positive weight indicates confidence in the rule's correctness^] (In Felix, weights can be set by the 
user or automatically learned. We do not discuss learning in this work.) 

Example 1 An important type of hard rule is a standard SQL query, e.g., to transform the results for use in the 
application. A more sophisticated example of hard rule is to encode that coreference has a transitive property, 
which is captured by the hard rule F%. Rules and Fq use person-organization co-occurrences (coOccurs) 
together with coreference (pCoref and oCoref ) to deduce affiliation relationships (affil). These rules are soft 
since co-occurrence in a webpage does not necessarily imply affiliation. 

Intuitively, when a soft rule is violated, we pay a cost equal to the absolute value of its weight (described 
below). For example, if coOccurs('Ullman', 'Stanford Univ.') and pCoref ('Ullman', 'Jeff Ullman'), but not 
af f il('Jeff Ullman', 'Stanford Univ.'), then we pay a cost of 4 because of Fq. The goal of an MLN inference 
algorithm is to find a prediction that minimizes the sum of such costs. 

Semantics An MLN program defines a probability distribution over database instances (possible worlds). 
Formally, we first fix a schema a (as in Figure [2]) and a domain D. Given as input a set of formulae F = 
Fi, . . . , F/v with weights w\, . . . , wn, they define a probability distribution over possible worlds (deterministic 
databases) as follows. Given a formula with free variables x = (xi, • • • , x m ), then for each d E D m , we create 
a new formula called a ground formula where gj denotes the result of substituting each variable X{ of Fy* 
with d{. We assign the weight to gg. Denote by G — (g,w) the set of all such weighted ground formulae of 
F. We call the set of all tuples in G the ground database. Let u> be a function that maps each ground formula 
to its assigned weight. Fix an MLN F, then for any possible world (instance) / we say a ground formula g is 
violated if w(g) > and g is false in /, or if w(g) < and g is true in /. We denote the set of ground formulae 

3 Roughly these weights correspond to the log odds of the probability that the statement is true. (The log odds of probability p is log j^— •) 
In general, these weights do not have a simple probabilistic interpretation [34]. 
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violated in a world / as V(I). The cost of the world / is 

cost m i n (7) = ^2 \w(g)\ (1) 
gev(i) 

Through cost m i n , an MLN defines a probability distribution over all instances using the exponential family of 
distributions (that are the basis for graphical models 1 46 1 ) : 

Pr[J] =Z- 1 exp{-cost mln (/)} 

where Z is a normalizing constant. 

Inference There are two main types of inference with MLNs: MAP (maximum a posterior) inference, where 
we want to find a most likely world, i.e., a world with the lowest cost, and marginal inference, where we want to 
compute the marginal probability of each unknown tuple. Both types of inference are essentially mathematical 
optimization problems that are intractable, and so existing MLN systems implement generic (search/sampling) 
algorithms for inference. As a baseline, Felix implements generic algorithms for both types of inference as 
well. Although Felix supports both types of inference in our decomposition architecture, in this work we focus 
on MAP inference to simplify the presentation. 

3.2 Lagrangian Relaxation 

We illustrate the basic idea of Lagrangian relaxation with a simple example. Consider the problem of minimizing 
a real-valued function f(xi,x 2 ,xs). Lagrangian relaxation is a technique that allows us to divide and conquer 
a problem like this. For example, suppose that / can be written as 

f(xi,x 2 , x 3 ) = fi(x ll x 2 ) + /2O2, x 3 ). 

While we may be able to solve each of f\ and f 2 efficiently, that ability does not directly lead to a solution to 
/ since f\ and f 2 share the variable x 2 . However, we can rewrite miii Xl ^ X2 ^ X3 f{xi,x 2 ,xs) into the form 

min X21) + 72(^22,^3) s.t. £2i = x 2 2, 

^1,^21,^22,^3 

where we essentially made two copies of x 2 and enforce that they are identical. The significance of such 
rewriting is that we can apply Lagrangian relaxation to the equality constraint to decompose the formula into 
two independent pieces. To do this, we introduce a scalar variable A G R (called a Lagrange multiplier) and 
define 

g(\) = min fi(xi, x 21 ) + f 2 (x 22 , x 3 ) + X(x 21 - x 22 ) 

Then maxA^(A) is called the dual problem of the original minimization problem on /. Intuitively, The dual 
problem trades off a penalty for how much the copies x 2 \ and x 22 disagree with the original objective value. 
If the resulting solution of this dual problem is feasible for the original program (i.e., satisfies the equality 
constraint), then this solution is also an optimum of the original program |51, p. 168]. 

The key benefit of such relaxation is that, instead of a single problem on /, we can now compute g{X) by 
solving two independent problems (each problem is grouped by parentheses) that are hopefully (much) easier: 

g{\) = ( min /i(xi,x 2 i) + Ax 2 i J + I min f 2 (x 22 ,x 3 ) - Xx 22 J . 

\Xl,X21 J \X22,X 3 J 

To compute maxA^(A), we can use standard techniques such as gradient descent |51, p. 174]. 

Notice that Lagrangian relaxation could be used for MLN inference: consider the case where X{ are truth 
values of database tuples representing a possible world / and define / to be cost m i n (7) as in Equation [TJ (Felix 
can handle marginal inference with Lagrangian relaxation as well, but we focus on MAP inference to simplify 
presentation.) 
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Figure 3: Execution Pipeline of Felix. 



Decomposition Choices The Lagrangian relaxation technique leaves open the question of how to decompose 
a function / in general and introduce equality constraints. These are the questions we need to answer first and 
foremost if we want to apply Lagrangian relaxation to MLNs. Furthermore, it is important that we can scale 
up the execution of the decomposed program on large datasets. 



4 Architecture of Felix 

In this section, we provide an overview of the Felix architecture and some key concepts. We expand on further 
technical details in the next section. At a high level, the way Felix performs MLN inference resembles how 
an RDBMS performs SQL query evaluation: given an MLN program T, Felix transforms it in several phases 
as illustrated in Figure |3j Felix first compiles an MLN program into a logical plan of tasks. Then, Felix 
performs optimization (code selection) to select the best physical plan that consists of a sequence of statements 
that are then executed (by a process called the Master). In turn, the Master may call an RDBMS or statistical 
inference algorithms. 



4.1 Compilation 

In MLN inference, a variable of the underlying optimization problem corresponds to the truth value (for MAP 
inference) or marginal probability (for marginal inference) of a query relation tuple. While Lagrangian relaxation 
allows us to decompose an inference problem in arbitrary ways, Felix focuses on decompositions at the level 
of relations: Felix ensures that an entire relation is either shared between subproblems or exclusive to one 
subproblem. A key advantage of this is that Felix can benefit from the set-oriented processing power of an 
RDBMS. Even with this restriction, any partitioning of the rules in an MLN program T is a valid decomposition. 
(For the moment, assume that all rules are soft; we come back to hard rules in Section [4. 3| ) 

Formally, let T = {^} be a set of MLN rules; denote by 1Z the set of query relations and x# the set of 
Boolean variables (i.e., unknown truth values) of R E 1Z. Let Ti, . . . , be a decomposition of T, and Hi C 1Z 
the set of query relations referred to by T^. Define = Ur^-jiXr; similarly x^. . Then we can write the MLN 
cost function as 



mincost^ ln (x^) = min^ cost^Jx^; 



1=1 



To decouple the subprograms, we create a local copy of variables x^. for each T^, but also introduce 
Lagrangian multipliers E Rl x ^l for each R E 1Z and each Tj s.t. R E 1Zj, resulting in the dual problem 

max g(X) 

( k } 

= max <^ J2 m } n [ COSt mln( X U + Xjii ' x k 

subject to ^2 X R = Vi? E 1Z. 

Thus, to perform Lagrangian relaxation on T, we need to augment the cost function of each subprogram 
with the \ l n . • x^. terms. As illustrated in the example below, these additional terms are equivalent to adding 
singleton rules with the multipliers as weights. As a result, we can still solve the (augmented) subproblems 
as MLN inference problems. 
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Figure 4: An example logical plan. Relations in shaded boxes are evidence relations. Solid arrows indicate data 
flow; dash arrows are control. 



Example 1 Consider a simple Markov Logic program T: 

1 GoodNews (p) => Happy (p) 0i 
1 BadNews (p) => Sad(p) 2 
5 Happy(p) <=> ^Sad(p) 3 

where GoodNews and BadNews are evidence and the other two relations are queries. Consider the decomposition 
Ti = {0i} and Y2 — {02,03}- ^1 an d ^2 share the relation Happy; so we create two copies of this relation: 
Happy 1 and Happy 2 , one for each subprogram. To relax the need that Happy 1 and Happy 2 be equal, we introduce 
Lagrange multipliers A p , one for each possible tuple Happy (p). We thereby obtain a new program T A : 



1 


GoodNews(p) 


=> Happy! 0) 


0'i 


Xp 


Happy i(p) 






1 


BadNews (p) - 


=> Sad(» 


02 


5 


Happy 2 0) <= 


=> ^Sad(p) 


03 


— X p 


Happy 2 (p) 




V?2 



This program contains two subprograms, T\ — {0^, <pi} and = {02, 03, ^2}, that can be solved indepen- 
dently. 



The output of compilation is a logical plan that consists of a bipartite graph between a set of subprograms 
(e.g., r A ) and a set of relations (e.g., GoodNews and Happy). There is an edge between a subprogram and a 
relation if the subprogram refers to the relation. In general, the decomposition could be either user-provided 
or automatically generated. In Sections |5.3| we discuss automatic decomposition. 



4.2 Optimization 

The optimization stage fleshes out the logical plan with code selection and generates a physical plan with 
detailed statements that are to be executed by a process in Felix called the Master. Each subprogram 
in the logical plan is executed as a task that encapsulates a statistical algorithm that consumes and produces 
relations. The default algorithm assigned to each task is a generic MLN inference algorithm that can handle 
any MLN program [30j. However, as we will see in Section 5.1, there are several families of MLNs that have 



specialized algorithms with high efficiency and high quality. For tasks matching those families, we execute them 
with corresponding specialized algorithms. 

The input/output relations of each task are not necessarily the relations in the logical plan. For example, 
the input to a classification task could be the results of some conjunctive queries translated from MLN rules. 
To model such indirection, we introduce data movement operators (DMOs), which are essentially datalog 
queries that map between MLN relations and task-specific relations. Roughly speaking, DMOs for specialized 
algorithms play a role that is similar to what grounding does for generic MLN inference. Given a task T^, it 
is the responsibility of the underlying algorithm to generate all necessary DMOs and register them with Felix. 
Figure [4] shows an enriched logical plan after code selection and DMO generation. DMOs are critical to the 
performance of Felix, and so we need to execute them efficiently. We observe that the overall performance of 
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an evaluation strategy for a DMO depends on not only how well an RDBMS can execute SQL, but also how 
and how frequently a task queries this DMO - namely the access pattern of this task. 

To expose the access patterns of a task to Felix, we model DMOs as adorned views |45 1. In an adorned view, 
each variable in the head of a view definition is associated with a binding-type, which is either b (bound) or f 
(free). Given a DMO Q, denote by x b (resp. x f ) the set of bound (resp. free) variables in its head. Then we can 
view Q as a function mapping an assignment to x° (i.e., a tuple) to a set of assignments to x f (i.e., a relation). 
Following the notation in Ullman 45 , a query Q of arity a(Q) is written as Q a (x) where a E {b, f} a ^. By 



default, all DMOs have the all-free binding pattern. But if a task exposes the access pattern of its DMOs, 
Felix can select evaluation strategies of the DMOs more informatively - Felix employs a cost-based optimizer 
for DMOs that takes advantage of both the RDBMS's cost-estimation facility and the data-access pattern of a 



task (see Section 5.2). 



Example 2 Say the subprogram F\-F^ in Figure [2] is executed as a task that performs coreference resolution 
on pCoref , and Felix chooses the correlation clustering algorithm [3j[5] for this task. At this point, Felix 
knows the data-access properties of that algorithm (which essentially asks only for "neighboring" elements). 
Felix represents this using the following adorned view: 

DMO bf (x, y) i- af f il(x, o), af f il(y, o), pSimSof t(x, y). 

which is adorned as bf . During execution, this coref task sends requests such as x = 'Joe', and expects to receive 
a set of names {y | DMO( c Joe', y)}. 

Sometimes Felix could deduce from the DMOs how a task may be parallelized (e.g., via key attributes), and 
takes advantage of such opportunities. The output of optimization is a DAG of statements. Statements are 
of two forms: (1) a prepared SQL statement; (2) a statement encoding the necessary information to run a task 
(e.g., the number of iterations an algorithm should run, data locations, etc.). 



4.3 Execution 

In Felix, a process called the Master coordinates the tasks by periodically updating the Lagrangian multiplier 
associated with each shared tuple (e.g., in Example [I]). Such an iterative updating scheme is called master- 
slave message passing. The goal is to optimize maxA^(A) using standard subgradient methods [5TJ p. 174]. 
Specifically, let p be an unknown tuple of i?, then at step k the Master updates each X p s.t. R G IZi using the 
following rule: 

V - X + rv, (r l - ^r-Ren 3 x P \ 
A p -A p ^a k yx p \ {j:Re1Zj} \J^ 

where a k is the gradient step size for this update. A key novelty of Felix is that we can leverage the underlying 
RDBMS to efficiently compute the gradient on an entire relation. To see why, let X J P be the multipliers for a 
shared tuple p of a relation R; X J P is stored as an extra attribute in each copy j of R. Note that at each iteration, 
Xp changes only if the copies of R do not agree on p (e.g., exactly one copy has p missing). Thus, we can update 
all Ap's with an outer join between the copies of R using SQL. The gradient descent procedure stops either 
when all copies have reached an agreement (or only a very small portion disagrees) or when Felix has run a 
pre- specified maximum number of iterations. 



Scheduling and Parallelism Between two iterations of message passing, each task is executed until comple- 
tion. If these tasks run sequentially (say due to limited RAM or CPU), then any order of execution would result 
in the same run time. On the other hand, if all tasks can run in parallel, then faster tasks would have to wait for 
the slowest task to finish until message passing could proceed. To better utilize CPU time, Felix updates the 
Lagrangian multipliers for a shared relation R whenever all involved tasks have finished. Furthermore, a task 
is restarted when all shared relations of this task have been updated. If computation resources are abundant, 
Felix also considers parallelizing a task. 



9 



Task 


Implementation 


Simple Classification 


Linear models [7] 


Correlated Classification 


Conditional Random Fields [24] 


Coreference 


Correlation clustering [3j|5] 



Table 1: Example specialized tasks and their implementations in Felix. 



Initialization and Finalization Let a = Ti, . . . , T n be a sequence of all tasks obtained by a breadth- first 
traversal of the logical plan. At initial execution time, to bootstrap from the initial empty state, we sequentially 
execute the tasks in the order of a, each task initializing its local copies of a relation by copying from the output 
of previous tasks. Then Felix performs the above master-slave message-passing scheme for several iterations; 
during this phase all tasks could run in parallel. At the end of execution, we perform a finalization step: we 
traverse a again and output the copy from T^ st for each query relation i?, where T^ st is the last task in a that 
outputs R. To ensure that hard rules in the input MLN program are not violated in the final output, we insist 
that for any query relation i?, T^ st respects all hard rules involving R. (We allow hard rules to be assigned to 
multiple tasks.) This guarantees that the output of the finalization step is a possible world for Y (provided that 
the hard rules are satisflable) . 



5 Technical Details 

Having set up the general framework, in this section, we discuss further technical challenges and solutions 
in Felix. First, as each individual task might be as complex as the original MLN, decomposition by itself 
does not automatically lead to high scalability. To address this issue, we identify several common statistical 
tasks with well- studied algorithms and characterize their correspondence with MLN subprograms (Section |5.1| ). 
Second, even when each individual task is able to run efficiently, sometimes the data movement cost may be 
prohibitive. To address this issue, we propose a novel cost-based materialization strategy for data movement 



operators (Section 5.2). Third, since the user may not be able to provide a good task decomposition scheme, it 
is important for Felix to be able to compile an MLN program into tasks automatically. To support this, we 
describe the compiler of Felix that automatically recognizes specialized tasks in an MLN program (Section^. 



5.1 Specialized Tasks 

By default, Felix solves a task (which is also an MLN program) with a generic MLN inference algorithm based 
on a reduction to MaxSAT [22], which is designed to solve sophisticated MLN programs. Ideally, when a task 
has certain properties indicating that it can be solved using a more efficient specialized algorithm, Felix should 
do so. Conceptually, the Felix framework supports all statistical tasks that can be modeled as mathematical 
programs. As an initial proof of concept, our prototype of Felix integrates two statistical tasks that are widely 
used in text applications: classification and coreference (see Table [I]). These specialized tasks are well-studied 
and so have algorithms with high efficiency and high quality. 



Classification Classification tasks are ubiquitous in text applications; e.g., classifying documents by topics or 
sentiments, and classifying noun phrases by entity types. In a classification task, we are given a set of objects 
and a set of labels; the goal is to assign a label to each object. Depending on the structure of the cost function, 
there are two types of classification tasks: simple classification and correlated classification. 

In simple classification, given a model, the assignment of each object to a label is independent from other 
object labels. We describe a Boolean classification task for simplicity, i.e., our goal is to determine whether each 
object is in or out of a single class. The input to a Boolean classification task is a pair of relations: the model 
which can be viewed as a relation M(/, w) that maps each feature / to a single weight w E R, and a relation 
of objects I(o, /); if a tuple (o, /) is in / then object o has feature / (otherwise not). The output is a relation 
R(o) that indicates which objects are members of the class (R can also contain their marginal probabilities). 
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For simple classification, the optimal R can be populated by including those objects o such that 

w > 

w:M(w,f) and I(o,f) 

One can implement a simple classification task with SQL aggregates, which should be much more efficient than 
the MaxSAT algorithm used in generic MLN inference. 

The twist in Felix is that the objects and the features of the model are defined by MLN rules. For example, 
the rules Fq and F7 in Figure [2] form a classification task that determines whether each af f il tuple (considered 
as an object) holds. Said another way, each rule is a feature. So, Felix populates the model relation M 
with two tuples: M(F6,+oo) and M(i*Y,8), and populates the input relation / by executing the conjunctive 
queries in Fq and F7; e.g., from F7 Felix generates tuples of the form I(P,0,F?), which indicates that the 
object af f il(P, O) has the feature Fj^ Operationally Felix performs such translation via DMOs that are also 
adorned with the task's access patterns; e.g., the DMO for / has the adornment 7 bbf since Felix classifies each 
affil(P, O) independently. 

Felix extends this basic model in two ways: (1) Felix implements multi-class classification by adding a 
class attribute to M and /. (2) Felix also supports correlated classification: in addition to per-object features, 
Felix also allows features that span multiple objects. For example, in named entity recognition if we see the 
token "Mr. " the next token is very likely to be a person's name. In general, one can form a graph where the 
nodes are objects and two objects are connected if there is a rule that refers to both objects. When this graph 
is acyclic, the task essentially consists of tree-structured CRF models that can be solved in polynomial time 
with dynamic programming algorithms |24| . 

Coreference Another common task is coreference resolution (coref), e.g., given a set of strings (say phrases in 
a document) we want to decide which strings represent the same real-world entity. These tasks are ubiquitous 
in text processing. The input to a coref task is a single relation £>(ol, o2, wgt) where wgt = (3 l,o2 £ R, indicates 
how likely the objects ol,o2 are coreferent (with being neutral). The output of a coref task is a relation 
R(ol,o2) that indicates which pairs of objects are coreferent - R is an equivalence relation, i.e., satisfying 
reflexivity, symmetry, and transitivity. Assuming that @ i,o2 — if (ol,o2) is not in the key set of the relation 
£>, then each valid R incurs a cost (called disagreement cost) 

cost coref (i?) = ^ I/W2I+ ^2 |yS i, o2 |- 

ol,o2:(ol,o2)£R ol,o2:(ol,o2)eR 

and Poi,o2>o and /3 ol , o2 <0 

The goal of coref is to find a relation with the minimum cost: 

R* = arg min cost core f (i?). 

R 

Coreference resolution is a well-studied problem [5|[l6]. The underlying inference problem is NP-hard in almost 
all variants. As a result, there is a literature on approximation techniques (e.g., correlation clustering [3j[5]). 
Felix implements these algorithms for coreference tasks. In Figure [2| F\ through F5 consist of a coref task 
for the relation pCoref . F\ through F3 encode the reflexivity, symmetry, and transitivity properties of pCoref , 
and F4 and F 5 essentially define the weights on the edges (similar to Arasu pi) from which Felix constructs 
the relation B (via DMOs). 

5.2 Optimizing Data Movement Operators 

Recall that data are passed between tasks and the RDBMS via data movement operators (DMOs). While 
the statistical algorithm inside a task may be very efficient (Section |5.1[ ), DMO evaluation could be a major 

4 In general a model usually has both positive and negative features. 



11 



scalability bottleneck. An important goal of Felix's optimization stage is to decide whether and how to 
materialize DMOs. For example, a baseline approach would be to materialize all DMOs. While this is a 
reasonable approach when a task repeatedly queries a DMO with the same parameters, in some cases, the 
result may be so large that an eager materialization strategy would exhaust available disk space. For example, 
on an Enron dataset, materializing the following DMO would require over 1TB of disk space: 

DMO bb (x,y) ^— mention(x, namel), mention^, name2) 1 
mayref (name 1, z\ mayref (name2 1 z). 



Moreover, some specialized tasks may inspect only a small fraction of their search space and so such eager 
materialization is inefficient. For example, one implementation of the coref task is a stochastic algorithm that 
examines data items roughly linear in the number of nodes (even though the input to coref contains a quadratic 
number of pairs of nodes) |5|. In such cases, it seems more reasonable to simply declare the DMO as a regular 
database view (or prepared statement) that is to be evaluated lazily during execution. 

Felix is, however, not confined to fully eager or fully lazy. In Felix, we have found that intermediate 
points (e.g., materializing a subquery of a DMO Q) can have dramatic speed improvements (see Section 6.4). 
To choose among materialization strategies, Felix takes hints from the tasks: Felix allows a task to expose 
its access patterns, including both an adornment Q a (see Section 4.2) and an estimated number of accesses t 
on Q. (Operationally t could be a Java function or SQL query to be evaluated against the base relations of 
Q.) Those parameters together with the cost-estimation facility of the underlying RDBMS (here, PostgreSQL) 
enable a System- R-style cost-based optimizer of Felix that explores all possible materialization strategies using 
the following cost model. 



Felix Cost Model To define our cost model, we introduce some notation. Let Q a (x) ^— #2, • • • , 9k be a 
DMO. Let G = {gi\l < i < k} be the set of subgoals of Q. Let Q = {Gi, . . . , G m } be a partition of G; i.e., 
Gj C G, Gi n Gj = for all i ^ j, and [JGj = G. Intuitively, a partition represents a possible materialization 
strategy: each element of the partition represents a query (or simply a relation) that Felix is considering 
materializing. That is, the case of one Gi = G corresponds to a fully eager strategy. The case where all Gi are 
singleton sets corresponds to a lazy strategy. 

More precisely, define Qj(xj) ^— Gj where xj is the set of variables in Gj shared with x or any other Gi for 
i ^ j. Then, we can implement the DMO with a regular database view Q r (x) ^— Qi, . . . , Q m - Let t be the total 
number of accesses on Q' performed by the statistical task. We model the execution cost of a materialization 
strategy as: 



ExecCost(Q / , t) = t • Inc a (Q') + ^ Mat(Q; 

i=i 



Mat(Qi) is the cost of eagerly materializing Qi and Inc a (Q / ) is the estimated cost of each query to Q f with 
adornment a. 

A significant implementation detail is that since the subgoals in Q' are not actually materialized, we cannot 
directly ask PostgreSQL for the incremental cost Inc Q ,(Q / )0 I n our prototype version of Felix, we implement 
a simple approximation of PostgreSQL's optimizer (that assumes incremental plans use only index-nested-loop 
joins), and so our results should be taken as a lower bound on the performance gains that are possible when 



materializing one or more subqueries. We provide more details on this approximation in Section C.3| Although 
the number of possible plans is exponential in the size of the largest rule in an input Markov Logic program, in 
our applications the individual rules are small. Thus, we can estimate the cost of each alternative, and we pick 
the one with the lowest ExecCost. 
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Tree Recursive 
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See Equation [2] 





Table 2: Properties assigned to predicates by the Felix compiler. KEY refers to a non-trivial key. Recursive 
properties are derived from all rules; the other properties are derived from hard rules. 



Task 


Required Properties 


Simple Classification 


KEY, NoREC 


Correlated Classification 


KEY, TrREC 


Coref 


REF, SYM, TRN 


Generic MLN Inference 


none 



Table 3: Tasks and their required properties. 



5.3 Automatic Compilation 

So far we have assumed that the mappings between MLN rules, tasks, and algorithms are all specified by 
the user. However, ideally a compiler should be able to automatically recognize subprograms that could be 
processed as specialized tasks. In this section we describe a best-effort compiler that is able to automatically 
detect the presence of classification and coref tasks . To decompose an MLN program T into tasks, Felix 
uses a two-step approach. Felix's first step is to annotate each query predicate p with a set of properties. An 
example property is whether or not p is symmetric. Table [2] lists of the set of properties that Felix attempts to 
discover with their definitions; NoREC and TrREC are rule-specific. Once the properties are found, Felix uses 
Table [3] to list all possible options for a predicate. When there are multiple options, the current prototype of 
Felix simply chooses the first task to appear in the following order: (Coref, Simple Classification, Correlated 
Classification, Generic). This order intuitively favors more specific tasks. To compile an MLN into tasks, 
Felix greedily applies the above procedure to split a subset of rules into a task, and then iterates until all 
rules have been consumed. As shown below, property detection is non-trivial as the predicates are the output 
of SQL queries (or formally, datalog programs). Therefore, Felix implements a best-effort compiler using a 
set of syntactic patterns; this compiler is sound but not complete. It is interesting future work to design more 
sophisticated compilers for Felix. 

Detecting Properties The most technically difficult part of the compiler is determining the properties of 
the predicates (cf. [14]). There are two types of properties that Felix looks for: (1) schema-like properties of 
any possible worlds that satisfy E and (2) graphical structures of correlations between tuples. For both types of 
properties, the challenge is that we must infer these properties from the underlying rules applied to an infinite 
number of databases^] For example, SYM is the property: 

"for any database I that satisfies T, does the sentence Mx, y.pCoref y) <^=>* pCoref(y, hold?\ 

Since / comes from an infinite set, it is not immediately clear that the property is even decidable. Indeed, REF 
and SYM are not decidable for Markov Logic programs. 

Although the set of properties in Table [2] is motivated by considerations from statistical inference, the first 
four properties depend only on the hard rules in E, i.e., the constraints and (SQL-like) data transformations 

5 PostgreSQL does not fully support "what-if" queries, although other RDBMSs do, e.g., for indexing tuning. 

6 As is standard in database theory 2 , to model the fact the query compiler runs without examining the data, we consider the domain of 
the attributes to be unbounded. If the domain of each attribute is known then, all of the above properties are decidable by the trivial algorithm 
that enumerates all (finitely many) instances. 
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in the program. Let be the set of rules in T that have infinite weight. We consider the case when is 
written as a datalog program. 

Theorem 5.1. Given a datalog program, Yqq, a predicate p, and a property 9 deciding if for all input databases 
p has property 9 is undecidable if 9 E {REF, SYM}. 

The above result is not surprising as datalog is a powerful language and containment is undecidable (2| ch. 12] 
(the proof reduces from containment). Moreover, the compiler is related to implication problems studied by 
Abiteboul and Hull (who also establish that generalizations of KEY and TRN problem are undecidable [l]). 
NoREC is the negation of the boundedness problem [lO] which is undecidable. 

In many cases, recursion is not used in Too (e.g., may consist of standard SQL queries that transform the 
data), and so a natural restriction is to consider without recursion, i.e., as a union of conjunctive queries. 

Theorem 5.2. Given a union of conjunctive queries Y^, deciding if for all input databases that satisfy Y^ the 
query predicate p has property 9 where 9 E {REF, SYM} (Table^ is decidable. Furthermore, the problem is 
II2P -Complete. KEY and TRN are trivially false. NoRec is trivially true. 

Still, Felix must annotate predicates with properties. To cope with the undecidability and intractability 
of finding out compiler annotations, Felix uses a set of sound (but not complete) rules that are described by 
simple patterns. For example, we can conclude that a predicate R is transitive if program contains syntactically 
the rule i?(x, y), z) —> R(x, z) with weight 00. 

Ground Structure The second type of properties that Felix considers characterize the graphical structure 
of the ground database (in turn, this structure describes the correlations that must be accounted for in the 
inference process). We assume that Y is written as a datalog program (with stratified negation). The ground 
database is a function of both soft and hard rules in the input program, and so we consider both types of rules 
here. Felix's compiler attempts to deduce a special case of recursion that is motivated by (tree-structured) 
conditional random fields that we call TrREC. Suppose that there is a single recursive rule that contains p in 
the body and the head is of the form: 

p(x, y), T (y, z ) => p( x , z ) ( 2 ) 

where the first attribute of T is a key and the transitive closure of T is a partial order. In the ground database, 
p will be "tree-structured". MAP and marginal inference for such rules are in P-time [40j[46]. Felix has a 
regular expression to deduce this property. 



6 Experiments 

Although MLN inference has a wide range of applications, we focus on knowledge-base construction tasks. In 
particular, we use Felix to implement the TAC-KBP challenge; Felix is able to scale to the 1.8M-document 
corpus and produce results with state-of-the-art quality. In contrast, prior (monolithic) approaches to MLN 
inference crash even on a subset of KBP that is orders of magnitude smaller. 

In Section [6TJ we compare the overall scalability and quality of Felix with prior MLN inference approaches 
on four datasets (including KBP). We show that, when prior MLN systems run, Felix is able to produce similar 
results but more efficiently; when prior MLN systems fail to scale, Felix can still generate high-quality results. 
In Sections |6.2| we demonstrate that the message-passing scheme in Felix can effectively reconcile conflicting 



predictions and has stable convergence behaviors. In Section [6731 we show that specialized tasks and algorithms 
are critical for Felix's high performance and scalability. In Section (^4, we validate that the cost-based DMO 
optimization is crucial to Felix's efficiency. 



Datasets and Applications Table[4]lists some statistics about the four datasets that we use for experiments: 
(1) KBP is a 1.8M-document corpus from TAC-KBP; the task is to perform two related tasks: a) entity linking: 
extract all entity mentions and map them to entries in Wikipedia, and b) slot filling: determine (tens of types 
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2.5M 


DBLife 


22K 


700K 


NFL 


1.1K 


100K 



Table 4: Statistics of input data. Note that MLN inference generates much larger intermediate data. 



of) relationships between entities. There is also a set of ground truths over a 2K-document subset (call it KBP- 
R) that we use for quality assessment. (2) NFL, where the task is to extract football game results (winners 
and losers) from sports news articles. (3) Enron, where the task is to identify person mentions and associated 
phone numbers in the Enron email dataset. There are two versions of Enron: EnrorQ is the full dataset; 
Enron- is a 680-email subset that we manually annotated person-phone ground truth on. We use Enron 
for performance evaluation, and Enron-R for quality assessment. (4) DBLif^J where the task is to extract 
persons, organizations, and affiliation relationships between them from a collection of academic webpages. For 
DBLife, we use the ACM author profile data as ground truth. 



MLN Programs For KBP, we developed MLN programs that fuse a wide array of data sources including 
NLP results, Web search results, Wikipedia links, Freebase, etc. For performance experiments, we use our entity 
linking program (which is more sophisticated than slot filling). The MLN program on NFL has a conditional 
random field model as a component, with some additional common-sense rules (e.g., "a team cannot be both 
a winner and a loser on the same day.") that are provided by another research project. To expand our set of 
MLN programs, we also create MLNs on Enron and DBLife by adapting rules in state-of-the-art rule-based 
IE approaches 12,25]: Each rule-based program is essentially equivalent to an MLN-based program (without 
weights). We simply replace the ad- hoc reasoning in these deterministic rules by a simple statistical variant. 
For example, the DBLife program in ClMPLE 12 says that if a person and an organization co-occur with some 
regular expression context then they are affiliated, and ranks relationships by frequency of such co-occurrences. 
In the corresponding MLN we have several rules for several types of co-occurrences, and ranking is by marginal 
probabilities. 



Experimental Setup To compare with alternate implementations of MLNs, we consider two state-of-the- 
art MLN implementations: (1) Alchemy, the reference implementation for MLNs |13|, and (2) Tuffy, an 
RDBMS-based implementation of MLNs [30] . Alchemy is implemented in C++. Tuffy and Felix are both 
implemented in Java and use PostgreSQL 9.0.4. Felix uses Tuffy as a task. Unless otherwise specified, all 
experiments are run on a RHEL5 workstation with two 2.67GHz Intel Xeon CPUs (24 total cores), 24 GB of 
RAM, and over 200GB of free disk space. 



6.1 High-level Scalability and Quality 

We empirically validate that Felix achieves higher scalability and essentially identical result quality compared 
to prior monolithic approaches. To support these claims, we compare the performance and quality of different 
MLN inference systems (Tuffy, Alchemy, and Felix) on the datasets listed above: KBP, Enron, DBLife, 
and NFL. In all cases, Felix runs its automatic compiler; parameters (e.g., gradient step sizes, generic inference 
parameters) are held constants across datasets. Tuffy and Alchemy have two sequential phases in their run 
time: grounding and search; results are produced only in the search phase. A system is deemed unscalable if it 
fails to produce any inference results within 6 hours. The overall scalability results are shown in Table [5] 

7 http : / /bailando . sims . berkeley . edu/ enron_email . html 

8 http : //www . cs . emu . edu/~einat /datasets . html 
http : //dblif e . cs . wise . edu 
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Table 5: Scalability of various MLN systems. 
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Figure 5: High-level quality results of various MLN systems. For each dataset, we plot a precision-recall curve 
of each system by varying k in top-k results; missing curves indicate that a system does not scale on the 
corresponding dataset. 



Quality Assessment We perform quality assessment on four datasets: KBP-R, NFL, Enron-R, and DBLife. 
On each dataset, we run each MLN system for 4000 seconds with marginal inference. (After 4000 seconds, the 
quality of each system has stabilized.) For KBP-R, we convert the output to TAC's query- answer format and 
compute the Fl score against the ground truth. For the other three datasets, we draw precision-recall curves: 
we take ranked lists of predictions from each system and measure precision/recall of the top-k results while 
varying the number of answers returnee^} The quality of each system is shown in Figure 
pairs that do not scale have no curves. 
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System-dataset 



KBP & NFL Recall that there are two tasks in KBP: entity linking and slot filling. On both tasks, Felix 
is able to scale to the 1.8M documents and after running about 5 hours on a 30- node parallel RDBMS, produce 
results with state-of-the-art quality |19^j We achieved an Fl score 0.80 on entity linking (human annotators' 
performance is 0.90), and an Fl score 0.34 on slot filling (state-of-the-art quality). In contrast, Tuffy and 
Alchemy crashed even on the three orders of magnitude smaller KBP-R subset. Although also based on an 
RDBMS, Tuffy attempted to generate about 10 11 and 10 14 tuples on KBP-R and KBP, respectively. 

To assess the quality of Felix as compared to monolithic inference, we also run the three MLN systems 
on NFL. Both Felix and Tuffy scale on the NFL data set, and as shown in Figure [5| produce results with 
similar quality. However, Felix is an order of magnitude faster: Tuffy took about an hour to start outputting 
results, whereas Felix's quality converges after only five minutes. We validated that the reason is that Tuffy 
was not aware of the linear correlation structure of a classification task in the NFL program, and ran generic 
MLN inference in an inefficient manner. 



Enron & DBLife To expand our test cases, we consider two more datasets - Enron-R and DBLife - to 
evaluate the key question we try to answer: does Felix outperform monolithic systems in terms of scalability 
and efficiency? From Table [5| we see that Felix scales in cases where monolithic MLN systems do not. On 

10 Results from MLN-based systems are ranked by marginal probabilities, results from Cimple are ranked by frequency of occur- 
rences, and results from rules on Enron-R are ranked by window sizes between a person mention and a phone number mention. 
The low recall on DBLife is because the ground truth (ACM author profiles) contains many facts absent from DBLife. 
12 Measured on KBP-R that has ground truth. 
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Figure 6: The RMSE between predictions from different tasks converges stably as Felix runs master-slave 
message passing. 



Enron-R (which contains only 680 emails), we see that when both Felix and Tuffy scale, they achieve similar 
result quality. From Figure [5j we see that even when monolithic systems fail to scale (on DBLife), Felix is 
able to produce high-quality results. 

To understand the result quality obtained by Felix, we also ran rule-based information-extraction programs 
for Enron-R and DBLife following practice described in the literature [l2p5 , 27| . Recall that the MLN programs 



for Enron-R and DBLife were created by augmenting the deterministic rule sets with statistical reasoning, 6 It 



should be noted that all systems can be improved with further tuning. In particular, the rules described in the 



literature ("Rule Set 1" for Enron-R (25j[27) and "Rule Set 2" for DBLife [12]) were not specifically optimized for 
high quality on the corresponding tasks. On the other hand, the corresponding MLN programs were generated 



in a constrained manner (as described in Section D.l). In particular, we did not leverage state-of-the-art NLP 



tools nor refine the MLN programs. With these caveats in mind, from Figure [5] we see that (1) on Enron-R, 
Felix achieves higher precision than Rule Set 1 given the same recall; and (2) on DBLife, Felix achieves 
higher recall than Rule Set 2 (i.e., ClMPLE [12]) at any precision level. This provides preliminary indication 
that statistical reasoning could help improve the result quality of knowledge-base construction tasks, and that 
scaling up MLN inference is a promising approach to high-quality knowledge-base construction. Nevertheless, it 
is interesting future work to more deeply investigate how statistical reasoning contributes to quality improvement 
over deterministic rules (e.g., Michelakis et al. |27 1). 

6.2 Effectiveness of Message Passing 

We validate that the Lagrangian scheme in Felix can effectively reconcile conflicting predictions between related 
tasks to produce consistent output. Recall that Felix uses master-slave message passing to iteratively reconcile 
inconsistencies between different copies of a shared relation. To validate that this scheme is effective, we measure 
the difference between the marginal probabilities reported by different copies; we plot this difference as Felix 
runs 100 iterations. Specifically, we measure the root-mean-square-deviation (RMSE) between the marginal 
predictions of shared tuples between tasks. On each of the four datasets (i.e., KBP-R, Enron-R, DBLife, and 
NFL), we plot how the RMSE changes over time. As shown in Figure [9j Felix stably reduces the RMSE 
on all datasets to an eventual value of below 0.1 - after about 80 iterations on Enron and after the very first 
iteration for the other three datasets. (As many statistical inference algorithms are stochastic, it is expected 
that the RMSE does not decrease to zero.) This demonstrates that Felix can effectively reconcile conflicting 
predictions, thereby achieving joint inference. 

MLN inference is NP-hard, and so it is not always the case that Felix converges to the exact optimal 
solution of the original program. However, as we validated in the previous section, empirically Felix converges 
to close approximations of monolithic inference results (only more efficiently). 



13 For Enron-R, we followed the rules described in related publications [25 [[27 . For DBLife, we obtained the Cimple 
and the DBLife dataset from the authors. Further details can be found in Section D.l 
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Table 6: Performance and quality comparison on individual tasks. "Initial" (resp. "Final") is the time when a 
system produced the first (resp. converged) result. "Fl" is the Fl score of the final output. 



6.3 Importance of Specialized Tasks 

We validate that the ability to integrate specialized tasks into MLN inference is key to Felix's higher perfor- 
mance and scalability. To do this, we first show that specialized algorithms have higher efficiency than generic 
MLN inference on individual tasks. Second, we validate that specialized tasks are key to Felix's scalability on 
MLN inference. 



Quality & Efficiency We first demonstrate that Felix's specialized algorithms outperform generic MLN 
inference algorithms in both quality and performance when solving specialized tasks. To evaluate this claim, we 
run Felix, Tuffy, and Alchemy on three MLN programs that each encode one of the following tasks: simple 
classification, correlated classification, and coreference. We use a subset of the Cora datasetFH for coref, and a 
subset of the CoNLL 2000 chunking dataset^ for classification. The results are shown in Table|6j While it always 
takes less than a minute for Felix to finish each task, Tuffy and Alchemy take much longer. Moreover, 
the quality of Felix is higher than Tuffy and Alchemy. As expected, Felix can achieve exact optimal 
solutions for classification, and nearly optimal approximation for coref, whereas Tuffy and Alchemy rely 
on a general-purpose SAT counting algorithm. Nevertheless, the above micro benchmark results are typically 
drowned out in larger-scale applications, where the quality difference tend to be smaller compared to the results 
here. 



Scalability To demonstrate that specialized tasks are crucial to the scalability of Felix, we remove specialized 
tasks from Felix and re-evaluate whether Felix is still able to scale to the four datasets (KBP, Enron, DBLife, 
and NFL). The results are as follows: after disabling classification, Felix crashes on KBP and DBLife; after 
disabling coref, Felix crashes on Enron. On NFL, although Felix is still able to run without specialized tasks, 
its performance slows down by an order of magnitude (from less than five minutes to more than one hour). 
These results suggest that specialized tasks are critical to Felix's high scalability and performance. 



6.4 Importance of DMO Optimization 

We validate that Felix's cost-based approach to data movement optimization is crucial to the efficiency of 
Felix. To do this, we run Felix on subsets of Enron with various sizes in three different settings: 1) Eager, 
where all DMOs are evaluated eagerly; 2) Lazy, where all DMOs are evaluated lazily; 3) Opt, where Felix 
decides the materialization strategy for each DMO based on the cost model in Section |5.2| 

We observed that overall Opt is substantially more efficient than both Lazy and Eager, and found that 
the deciding factor is the efficiency of the DMOs of the coref tasks. Thus, we specifically measure the total 

14 http : //alchemy . cs . Washington . edu/data/c ora 
http : / / www . cnt s .ua.ac. be/ conll2000/ chunking/ 
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Table 7: DMO efficiency under different settings. 



run time of individual coref tasks, and compare the results in Table [7| Here, E-xk for x <E {5, 20, 50, 100} 
refers to a randomly selected subset of xk emails in the Enron corpus. We observe that the performance of the 
eager materialization strategy degrades rapidly as the dataset size increases. The lazy strategy performs much 
better. The cost-based approach can further achieve 2-3X speedup. This demonstrates that our cost-based 
materialization strategy for data movement operators is crucial to the efficiency of Felix. 

7 Conclusion and Future Work 

We present our Felix approach to MLN inference that uses relation-level Lagrangian relaxation to decompose 
an MLN program into multiple tasks and solve them jointly. Such task decomposition enables Felix to inte- 
grate specialized algorithms for common tasks (such as classification and coreference) with both high efficiency 
and high quality. To ensure that tasks can communicate and access data efficiently, Felix uses a cost-based 
materialization strategy for data movement. To free the user from manual task decomposition, the compiler of 
Felix performs static analysis to find specialized tasks automatically. Using these techniques, we demonstrate 
that Felix is able to scale to complex knowledge-base construction applications and produce high-quality results 
whereas previous MLN systems have much poorer scalability . Our future work is in two directions: First, we 
plan to apply our key techniques (in-database Lagrangian relaxation and cost-based materialization) to other 
inference problems. Second, we plan to extend Felix with new logical tasks and physical implementations to 
support broader applications. 
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A Notations 

Table [8] defines some common notation that is used in the following sections. 



Notation 


Definition 


a, 6, . . . , a, /3, . . . 


Singular (random) variables 


a, 6,. . ., a, /3,. . . 


Vectorial (random) variables 




Dot product between vectors 


ImI 


Length of a vector or size of a set 




jth e i em ent of a vector 


a, d 


A value of a variable 



Table 8: Notations 



B Theoretical Background of the Operator-based Approach 

In this section, we discuss the theoretical underpinning of Felix's operator-based approach to MLN inference. 
Recall that Felix first decomposes an input MLN program based on a predefined set of operators, instantiates 
those operators with code selection, and then executes the operators using ideas from dual decomposition. We 
first justify our choice of specialized subtasks (i.e., Classification, Sequential Labeling, and Coref) in terms of 
two compilation soundness and language expressivity properties: 

1. Given an MLN program, the subprograms obtained by Felix's compiler indeed encode specialized sub- 
tasks such as classification, sequential labeling, and coref. 

2. MLN as a language is expressive enough to encode all possible models in the exponential family of each 
subtask type; specifically, MLN subsumes logistic regression (for classification), conditional random fields 
(for labeling), and correlation clustering (for coref). 

We then describe how dual decomposition is used to coordinate the operators in Felix for both MAP and 
marginal inference while maintaining the semantics of MLNs. 



B.l Consistent Semantics 

B.l.l MLN Program Solved as Subtasks 

In this section, we show that the decomposition of an MLN program produced by Felix's compiler indeed 
corresponds to the subtasks defined in Section |42) 



Simple Classification Suppose a classification operator (i.e., task) for a query relation R{k,v) consists of 



key-constraint hard rules together with rules ri, ...,r^ (with weights w\, ...,wt) 16 As per Felix's compilation 
procedure, the following holds: 1) R(k,v) has a key constraint (say k is the key); and 2) none of the selected 
rules are recursive with respect to R. 

Let ko be a fixed value of k. Since A: is a possible- world key for R(k,v), we can partition the set of all 
possible worlds into sets based on their v for R(ko, v) (and whether there is any value v make R{k, v) true). Let 
yy v . = {W | W \= R(ko,Vi)} and W± where R(k ,v) is false for all v. Define Z(W) = J2 w ew exp{-cosi(u;)}. 
Then according to the semantics of MLN, 



Pr[i?(Mo) 



16 For simplicity, we assume that these t rules are ground formulas. It is easy to show that grounding does not change the property 
of rules. 
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It is immediate from this that each class is disjoint. It is also clear that, conditioned on the values of the 
rule bodies, each of the R are independent. 



Correlated Classification Suppose a correlated classification operator outputs a relation R(k, v) and con- 
sists of hard-constraint rules together with ground rules ri,...rt (with weights wi,...,wt). As per Felix's 
compilation procedure, the following holds: 

• R{k, v) has a key constraint (say k is the key); 

• The rules T{ satisfy the TrREC property. 

Consider the following graph: the nodes are all possible values for the key k and there is an edge (fc, k f ) if k 
appears in the body of k 1 . Every node in this graph has outdegree at most 1. Now suppose there is a cycle: But 
this contradicts the definition of a strict partial order. In turn, this means that this graph is a forest. Then, 
we identify this graph with a graphical model structure where each node is a random variable with domain D. 
This is a tree-structured Markov random field. This justifies the rules used by Felix's compiler for identifying 
labeling operators. Again, conditioned on the rule bodies any grounding is a tree-shaped graphical model. 

Coreference Resolution A coreference resolution subtask involving variables j/i, ...y n infers about an equiv- 
alent relation R(yi,yj). The only requirement of this subtask is that the result relation i?(., .) be reflexive, 
symmetric and transitive. Felix ensures these properties by detecting corresponding hard rules directly. 

B.1.2 Subtasks Represented as MLN programs 

We start by showing that all probabilistic distributions in the discrete exponential family can be represented 
by an equivalent MLN program. Therefore, if we model the three subtasks using models in the exponential 
family, we can express them as an MLN program. Fortunately, for each of these subtasks, there are popular 
exponential family models: 1) Logistic Regression (LR) for Classification, 2) Conditional Random Filed (CRF) 
for Labeling and 3) Correlation Clustering for Coref. |^| 

Definition B.l (Exponential Family). We follow the definition in [46]. Given a vector of binary random 
variables x E X, let <fi : X —> {0, l} d be a binary vector-valued function. For a given (/), let 6 GR d be a vector 
of real number parameters. The exponential family distribution over x associated with <fi and is of the form: 

Pr[x}=exp{-0-cf>(x)-A(0)}, 
where A{6) is known as log partition function: A{6) = log^cceA' ex P{ — ' 4>{ x )}- 

This definition extends to multinomial random variables in a straightforward manner. For simplicity, we 
only consider binary random variables in this section. 

Example 1 Consider a textbook logistic regressor over a random variable x E {0, 1}: 

Pr[x = 1] = l + exp^-A/i}' 

where fi E {0, l}'s are known as features of x and f3^s are regression coefficients of f^s. This distribution 
is actually in the exponential family: Let be a binary vector-valued function whose i th entry equals to 
4>i(x) = (1 — x)f{. Let 6 be a vector of real numbers whose i th entry 0^ = One can check that 

exp {-0-0(1)} 



Pr[x = 1] = 



exp {-0 ■ 0(1)} + exp {-0 ■ 0(0)} 
1 



1 + exp {£<"&/<} 



7 We leave the discussion of models that are not explicitly in exponential family to future work. 
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The exponential family has a strong connection with the maximum entropy principle and graphic models. For 
all the three tasks we are considering, i.e., classification, labeling and coreference, there are popular exponential 
family models for each of them. 

Proposition B.l. Given an exponential family distribution over x E X associated with <fi and 6, there exists 
an MLN program T that defines the same probability distribution as Pyq[x]. The length of the formula in T is 
at most linear in \x\, and the number of formulas in T is at most exponential in \x\. 

Proof Our proof is by construction. Each entry of is a binary function (j>i{x)^ which partitions X into two 
subsets: X^~ = {x\<pi(x) = 1} and X[~ = {x\(j>i{x) = 0}. If Oi > 0, for each x E introduce a rule: 

Oi \J R(xj, 1 - Xj). 

l<j<\x\ 

If Oi < 0, for each x E insert a rule: 

-Oi f\ R{xj,Xj). 

l<j<\x\ 

We add these rules for each (/){(.), and also add the following hard rule for each variable xf. 

oo R(xi,0) <=> -iR(xi,l). 

It is not difficult to see Pr[Vx^, R(x{, xi) — 1] = Pr^[x]. In this construction, each formula has length \x\ and 
there are X^d^l + 1) formulas in total, which is exponential in \x\ in the worst case. □ 

Similar constructions apply to the case where x is a vector of multinomial random variables. 

We then show that Logistic Regression, Conditional Random Field and Correlation Clustering all define 
probability distributions in the discrete exponential family, and the number of formulas in their equivalent MLN 
program Y is polynomial in the number of random variables. 



Logistic Regression In Logistic Regression, we model the probability distribution of Bernoulli variable y 
conditioned on xi, E {0, 1} by 

Pv[y = 1] 



l + exp{-(/3 + E* PiXi)} 

Define (j>i(y) = (1 — y)Xi (c/>o(y) = 1 — y) and Oi — we can see Pr[y = 1] is in the exponential family 



defined as in Definition B.l, For each 4>i{y), there is only one y that can get positive value from so there are 



at most k + 1 formulas in the equivalent MLN program. 



Conditional Random Field In Conditional Random Field, we model the probability distribution using a 
graph G — (V, E) where V represents the set of random variables y = {y v : v E V}. Conditioned on a set of 
random variables cc, CRF defines the distribution: 

Pr[y\x] oc exp{ ^ \kfk(v, 
vev,k 

+ M9i((vu ^2), y Vl , Vv 2 ,x)} 

(vi,V2)EE,l 

This is already in the form of exponential family. Because each function fk(v, — ,x) or ^((^1,^2), — ? — , sc) 
only relies on 1 or 2 random variables, the resulting MLN program has at most 0{\E\ + \ V\) formulas. In the 
current prototype of Felix, we only consider linear chain CRFs, where l^l = 0(|F|). 
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Correlation Clustering Correlation clustering is a form of clustering for which there are efficient algorithms 
that have been shown to scale to instances of the coref problem with millions of mentions. Formally, correlation 
clustering treats the coref problem as a graph partitioning problem. The input is a weighted undirected graph 
G — (V, /) where V is the set of mentions with weight function / : V 2 — >> R. The goal is to find a partition 
C = {Ci} of V that minimizes the disagreement cost: 

cost cc (C)= J2 l/(«i.«2)l+ £ \f(vi,v 2 )\ 

(vl,v2)eV 2 (vl,v2)eV 2 

3Ci,vieCiAv2€Ci 3Ci,vieCiAv 2 ^Ci 
f(u,v)<0 f(u,v)>0 

We can define the probability distribution over C similarly as MLN: 



Pr[C] oc exp{— cost cc (C)} 

Specifically, let the binary predicate coref (vi, v 2 ) indicate whether v\ 7^ V2 E V belong to the same cluster. 
First introduce three hard rules enforcing the reflexivity, symmetry, and transitivity properties of coref. Next, 
for each v\ 7^ V2 E V, introduce a singleton rule coref (vi, V2) with weight f{v\,V2). It's not hard to show that 
the above distribution holds for this MLN program. 



B.2 Dual Decomposition for MAP and Marginal Inference 

In this section, we formally describe the dual decomposition framework used in Felix to coordinate the oper- 
ators. We start by formalizing MLN inference as an optimization problem. Then we show how to apply dual 
decomposition on these optimization problems. 



B.2.1 Problem Formulation 

Suppose an MLN program T consists of a set of ground MLN rules TZ — {r\, r m } with weights (w\, ...,w m ). 
Let X = {x\, x n } be the set of boolean random variables corresponding to the ground atoms occurring in T. 
Each MLN rule r^ introduces a function fa over the set of random variables 7^ C X mentioned in rf. fa{iti) — 1 
if ri is violated and otherwise. Let iu be a vector of weights. Define vector 4>{X) — (<pi (tti ),..., <^ m (7r m )). 
Given a possible world x E 2 X , the cost can be represented: 

cost{x) w • 4>{x) 

Suppose Felix decides to solve Y with t operators Oi, Of. Each operator Oi contains a set of rules TZi C TZ. 
The set {TZi} forms a partition of TZ. Let the set of random variables for each operator be X{ — {J r .^ n .Tij. Let 
rti — \X{\. Thus, each operator Oi essentially solves the MLN program defined by random variables X{ and 
rules TZ{. Given w, define w l to be the weight vector whose entries equal w if the corresponding rule appears 
in TZi and otherwise. Because TZi forms a partition of TZ, we know J2i wt — w - For each operator Oi, define 
an n-dim vector /^(X), whose j th entry equals xj if xj E Xi and otherwise. Define n-dim vector fi(X) whose 
j th entry equals Xj. Similarly, let 4>{Xi) be the projection of 4>(X) onto the rules in operator Oi. 



Example 2 We use the two sets of rules for classification and labeling in Section |5.1| as a running example. 
For a simple sentence Packers win. in a fixed document D which contains two phrases P\ — "Packers" and 
P2 — "win" , we will get the following set of ground formulae 
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00 label(D,p, 11), label(D,p, 12) => 11 = 12 (r n ) 
10 next(D,Pl,P2),token(P2,'wins , ) => label (D, P x , W) (r Z2 ) 

1 la.bel(D,P 1 ,W),next(D,P 1 ,P 2 ) =>!label(£>, P 2 , W) (r Z3 ) 
10 label(D, Pi, W), ref erTo(Pi, GreenBay) => winner (GreenB ay) (r c i) 
10 label(Z), Pi, L), ref erTo(Pi, GreenBay) => Iwinner (Green Bay) (r c2 ) 



3 For rn, p e {Pi, P2}, U G {W, L}. 
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After compilation, Felix would assign r/i, r\2 and 773 to a labeling operator Ol, and r c \ and r C 2 to a classi- 
fication operator Oc- For each of {winner (GreenBay), label (D, Pi, W), label(D, Pi, L), label(D, P2, W), 
label(Z), P2, L)} we have a binary random variable associated with it. Each rule introduces a function 0, for 
example, the function 0/2 introduced by 772 is: 



/2 (label(D,Pi,WO) 



1 if label(Z), Pi, W) = False 
if label(P>, P U W) = True 



The labeling operator Ol essentially solves the MLN program with variables Xl = {label(D, Pi, W), 
label(P>, Pi, L), label(P>, P 2 , W), label(P>, P 2 , L)} and rules K L = {r tu r/ 2 , n 3 }. Similarly O c solves the MLN 
program with variables Xq — {winner (GreenBay), label(P), Pi, W) label(P>, Pi, L)} and rules TZc — {r c i, 
r C 2}. Note that these two operators share the variables label(P>, Pi, W) and label(P>, Pi, L). 

B.2.2 MAP Inference 

MAP inference in MLNs is to find an assignment x to X that minimizes the cost: 

min w • cb(x). (3) 

£cG{0,l} n 

Each operator 0\ performs MAP inference on Xf. 

min w 1 • <±>(xi). (4) 

Our goal is to reduce the problem represented by Eqn. [3] into subproblems represented by Eqn. [4j Eqn. [3] 
can be rewritten as 



min w l • (bixi). 

'J 1 ^ + 



* e l°'-> Ki<t 



Clearly, the difficulty lies in that, for i ^ j, Xi and Aj may overlap. Therefore, we introduce a copy of 
variables for each 0{\ Xf . Eqn. now becomes: 



min N w l - <p( 

={0,1}^,* V 

The Lagrangian of this problem is: 



x?) 



s.t. Vi = x. 



(5) 



£(aj,ajf,...,xf,i/i,...,.i/i) 



(6) 



Thus, we can relax Eqn. [3] into 



max < > min w l • <b(xf) + Vi • uAxf) — max > V{ • uA 



x 



The term max^ J^i v i ' l^i( x ) — 00 unless for each variable x 
Converting this into constraints, we get 



3 ' 
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max < >^ min w l • 4>(xf) + v; ■ uAxf) 



I 



s.t. Mxj = 

We can apply sub-gradient methods on v. The dual decomposition procedure in Felix works as follows: 

1 . Initialize , . . . , vf^ . 

2. At step k (starting from 0): 

(a) For each operator C^, solve the MLN program consisting of: 1) original rules in this operator, which 
are characterized by w l ] 2) additional priors on each variables in JQ, which are characterized by v\ . 

(b) Get the MAP inference results xf. 

3. Update V{\ 



ex t x ?,j 



hj ~ "id /x yij \{i--xjex t }\ J 

Example 3 Consider the MAP inference on program in Example [2j As Ol and Oc share two random variables: 
x w — label(D, Pi, W) and x\ — label(P>, Pi, L), we have a copy of them for each operator: x^ 0l , xfo L for 
Ol] and , x fo c ^ or ® c ' Therefore, we have four v\ v w p L , vi,o L for Ol; and v Wi o c , vi,o c f° r Oc- Assume 

we initialize each z/ ^ to at the first step. 

We start by performing MAP inference on Ol and Oc respectively. In this case, Ol will get the result: 

x w,O l 1 

Oc admits multiple possible worlds minimizing the cost; for example, it may outputs 

x w,O c 

x ?,o c 

which has cost 0. Assume the step size A = 0.5. We can update v to: 

i/W =-0.25 



w,O l 
w,O c 



i£L =0.25 



•&o =° 

Therefore, when we use these to conduct MAP inference on and Oc, we are equivalently adding 

-0.25 label(L>,Pi, W) (r[) 

into Ol and 

0.25 label(P,Pi, W) (r' c ) 

into Oc- Intuitively, one may interpret this procedure as the information that u Ol prefers label(P), Pi, W) to 
be true" being passed to Oc via r' c . 
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B.2.3 Marginal Inference 

The marginal inference of MLNs aims at computing the marginal distribution (i.e., the expectation since we 
are dealing with boolean random variables): 

£ = E to [ A ipO]. (7) 
The sub-problem of each operator is of the form: 

fi = E Wo [^ (X )}. (8) 
Again, the goal is to use solutions for Eqn. [8] to solve Eqn. [7| 

We first introduce some auxiliary variables. Recall that fJ>(X) corresponds to the set of random variables, 
and 4>{X) corresponds to all functions represented by the rules. We create a new vector £ by concatenating 
and <fi\ £(X) = (/x T (X), <p T (X)). We create a new weight vector = (0, 0, w T ) which is of the same length 
as It is not difficult to see that the marginal inference problem equivalently becomes: 

l = Mi{x)\. (9) 

Similarly, we define Oo for operator O as Oo = (0, 0, Wq). We also define a set of 0: Oo, which contains 
all vectors with entries corresponding to random variables or cliques not appear in operator O as zero. The 
partition function A(0) is: 



A(0) = £exp{-0.£(#)} 



The conjugate dual to A is: 



x 



A*(O = sup{0.£-A(0)} 
e 



A classic result of variational inference 146 shows that 



£ = argsup{0.£-A*(O}, (10) 



where M is the marginal polytope. Recall that £ is our goal (see Eqn.[9j). Similar to MAP inference, we want to 
decompose Eqn. [10] into different operators by introducing copies of shared variables. We first try to decompose 
A*(g). In we search 6 on all possible values for 6. If we only search on a subset of 0, we can get a lower 

bound: 

A*°(£)= sup{0-£-A*(O}<Am 
0eO o 

Therefore, 

-A*(0<-5>A*°(£), 
m o 

where m is the number of operators. We approximate £ using this bound: 

| = arg sup{0.£-i^A*°(£)}, 
which is an upper bound of the original goal. We introduce copies of £: 
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£ = arg sup &0-e-^E A *°^°)} 

s.t. £ = £ e ,v e e x un ,vo 

The Lagrangian of this problem is: 

i 

where V{ E O^, which means only the entries corresponding to random variables or cliques that appear in 
operator Oi are allowed to have non-zero values. We get the relaxation: 



min V sup [Oi • £ 0i - —A* 0i (£ 0i ) + v % • 



mm 



Considering the min^ J2i v % ' £ part. This part is equivalent to a set of constraints: 

v ^ =0,Vx e X 

Oi-.xeXi 

u iiX =0,Vx X 

Therefore, we are solving: 

min V sup {mOi • - A*°* + v { • } 

s.t., y Vi iX = 0,VxEX 
Oi-.xeXi 

Vi, x = 0,Vx£X 

Note the factor m in front of 0^ it implies that we multiply the weights in each subprogram by m as well. 
Then we can apply sub-gradient method on Vi'. 

1 . Initialize \ . . . , vf^ . 

2. At step k (start from 0): 

(a) For each operator O^, solve the MLN program consists of: 1) original rules in this operator, which is 

(k) 

characterized by m^; 2) additional priors on each variables in X^ which is characterized by v\ . 

(b) Get the marginal inference results £p . 

3. Update ^f +1) : 

-,(*+!) _ _\( tC _ ^^exi tfA 

Example 4 Consider the marginal inference on the case in Example [2j Similar to the example for MAP 
inference, we have copies of random variables: ^ O l i ^?O l ^ or Ol\ an d C^o c ' ^?O c ^ or ® c ' ^ e a ^ so nave f° ur 
z/: v w,O l ^ u i,O l f° r Ol; and v w <o c , vi,o c f° r ^C- Assume we initialize each i/ ^ to at the first step. 
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We start by conducting marginal inference on Ol and Oc respectively. In this case, Ol will get the result: 



while Oc will get: 



Co L = 0-99 
€o L = 0-01 



£w,O c — 0- 5 



Assume the step size A = 0.5. We can update v as: 

»% c =0.12 

»lX =°' 12 
"| ( ,Oc=-0.12 

Therefore, when we use these z/ 1 ^ to conduct marginal inference on Ol and Oc, we are equivalantly adding 

-0.12 label(L>,Pi, W)(r' n ) 
0.12 label(D,P u L) (r' l2 ) 



into Ol and 



0.12 label(L>, P^WHr^) 
-0.12 label(D,Pi,L) (r' c2 ) 



into Oc- Intuitively, one may interpret this procedure as the information that "Ol prefers label(Z), Pi, W) to 
be true" being passed to Oc via r' c . 

C Additional Details of System Implementation 

In this section, we provide additional details of the Felix system. The first part of this section focuses on the 
compiler. We prove some complexity results of property-annotation used in the compiler and describe how to 
apply static analysis techniques originally used in the Datalog literature for data partitioning. Then we describe 
the physical implementation for each logical operator in the current prototype of Felix. We also describe the 
cost model used for the materialization trade-off. 



C.l Compiler 

C.l.l Complexity Results 

In this section, we first prove the decidability of the problem of annotating properties for arbitrary Datalog 
programs. Then we prove the n^P-completeness of the problem of annotating {REF, SYM} given a Datalog 
program without recursion. 
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Recursive Programs If there is a single rule with query relation Q of the form Q(x,y) <= Ql(x),Q2(y), 
then that {REF, SYM} of Q is decidable if and only if Ql or Q2 is empty or Ql = Q2. We assume that Ql and 
Q2 are satisflable. If there is an instance where Ql{a) is true and Q2 is false for all values. Then there is another 
world (with all fresh constants) where Q2 is true (and does not return a). Thus, to check REF and SYM for 
Q, we need to decide equivalence of datalog queries. Equivalence of datalog queries is undecidable pi ch. 12]. 
Since containment and boundedness for monadic datalog queries is decidable, a small technical wrinkle is that 
while Ql and Q2 are of arity one (monadic) their bodies may contain other recursive (higher arity) predicates. 

Complexity for Nonrecursive Program The above section assumes that we are given an arbitrary Datalog 
program T. In this section, we show that the problem of annotating REF and SYM given a nonrecursive Datalog 
program is I^P-complete. We allow inequalities in the program. 

We first prove the hardness. Similar to the above section, we need to decide Ql = Q2. The difference is that 
Ql and Q2 do not have recursions. Since our language allows us to express conjunctive queries with inequality 
constraints, this established II2P hardness |23 . 

We now prove the membership in II2P. We first translate the problem of property- annotation to the 
containment problem of Datalog programs, which has been studied for decades [9j[23] and the complexity is in 
II2P for Datalog programs without recursions but with inequalities. We will show that, even though the rules 
for checking symmetric property is recursive, it can be represented by a set of non-recursive rules, therefore the 
classic results still hold. 

We thus limit ourselves to non-recursive MLN programs. Given an MLN program T which is the union of 
conjunctive queries and a relation Q to which we will annotate properties, all hard rules related to Q can be 
represented as: 

QO ■■ -GiO 

QO : -G 2 () (A) 
QO : -G n Q 

where each GiQ contains a set of subgoals. To annotate whether a property holds for the relation QQ, we test 
whether some rules hold for all database instances I generated by the above program P\. For example, for the 
symmetric property, we label QQ as symmetric if and only if Q(x,y) —> Q(y,x) holds. We call this rule the 
testing rule. Suppose the testing rule is QQ : — T(), we create a new program: 

QO ■ -GiO 
QO ■■ -G2O 

(P2) 

QO ■ -G n Q 
QO ■ -TO 

Given a database D, let P\{D) be the result of applying program P\ to D (using Datalog semantics). The 
testing rule holds for all P\(D) if and only if VD, P2QD) Q P\(D). In other words, P2 is contained by P\ 
(P2 Q Pi). For reflexive property, whose testing rule is Q(x,x) : —V(x) (where VQ is the domain of x), both 
Pi and P2 are non-recursive and the checking of containment is in II2P [23] . 

We then consider the symmetric property, whose testing rule is recursive. This is difficult at first glance 
because the containment of recursive Datalog program is undecidable. However, for this special case, we can 
show it is much easier. For the sake of simplicity, we consider a simplified version of P\ and P2'. 

Q(x,y):-G(x,y,z) (P[) 
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Property 


Pattern 

Template Condition 


REF 


Pi (a, 6) 


a = 6 


Pi(a,b)V\R 1 (c)V\R 2 (d) 


a = c,b = d,R\ = P 2 , Pi 7^ Pi 


SYM 


Pi(a,6)V!P 2 (c, d) 


a = d, 6 = c, Pi = P2 


P 1 (a,b)V\Ri(c)V\R 2 (d) 


a = c, 6 = d, Pi = R 2l P\ ^ Pi 


TRN 


!Pi(a,&)V!P 2 (c, d)VPs(eJ) 


6 = c, a = e, d = /, Pi = P 2 = P 3 


KEY 




n — P h — r r\ — f Pi — Po 


NoREC 


7?-, n \/ \/ 7? n \/ p. n 

/tig V ... V iinU V ~lU 


p. p 


7?-, n \/ \/ t? p, n 


p, p 


TrRec 


r> /O, \/ T'f ^ ,i\ \/ r> ( n £ \ 
r\\a,o) V 1 (c,aj V F2\e,j) 


— C, a — J, (2 — e, i 1 — r 2l 1 yC, a) — [a — C -\- x\, X 7= U 


P /V/ M \/ T( n rl\ \/ P ( o f\ 

Fi{a,o) V i {c,a) V F 2 {e, j) 


— c, a — j , a — e, j 1 — 12 . V^c, aj t 1 ,c l a 


•Fi{a, o) v l [c, a) v /^(e, j) 


— c, a — J, ft — 6, r 1 — 7 9 , ^c, aj — [a — c -\- x\, x =p \J 


\P 1 (a,b)VT(c,d)VP 2 (eJ) 


b = c,d = f,a = e,P 1 =P 2 , V(c, d) eT,cQd 


Pi (a, 6) VT(c,d)V!P 2 (e,/) 


b = c, d = / ', a = e, Pi = P 2 , T(c, d) = [d = c + x],x^0 


Pi(a,6)VT(c,d)V!P 2 (e,/) 


6 = c, d = f, a = e, Pi = P 2 , V(c, d) eT,c^d 


LPiM) VT(M)V!P 2 (e,/) 


b = c, d = / ', a = e, Pi = P 2 , T(c, d) = [d = c + i],x^0 


!Pi(a,6)VT(c,d)V!P 2 (e,/) 


6 = c, d = /, a = e, Pi = P 2 , V(c, d) eT,cQd 



Table 9: Sufficient Conditions for Properties. All Patterns for REF, SYM, TRN, and KEY are hard rules. 



Q(x,y) : -G(x,y,z) 

Q(x,y) : -Q(y,x) v 2 

We construct the following program: 

Q(x,y) : -G(x,y,z) 
Q(x,y) : -G(y,x,z) 

It is easy to show P2 = P3, therefore, we can equivalently check whether P3 C Pj, which is in II2P since 
neither of the programs is recursive. 



C.1.2 Patterns Used by the Compiler 

Felix exploits a set of regular expressions for property annotation. This set of regular expressions forms a 
best-effort compiler, which is sound but not complete. Table [9] shows these patterns. In Felix, a pattern 
consists of two components - a template and a boolean expression. A template is a constraint on the "shape" 
of the formula. For example, one template for SYM looks like Pi (a, 6)v!P 2 (c, d), which means we only consider 
rules whose disjunction form contains exactly two binary predicates with opposite senses. Rules that pass the 
template-matching are considered further using the boolean expression. If one rule passes the template-matching 
step, we can have a set of assignments for each predicate P and variable a, 6, .... The boolean expression is 
a first order logic formula on the assignment. For example, the boolean expression for the above template is 
(a = d) A (b = c) A (Pi = P 2 ), which means the assignment of Pi and P2 must be the same, and the assignment of 
variables a, 6, c, d must satisfy (a = d) A (b = c). If there is an assignment that satisfies the boolean expression, 
we say this Datalog rule matches with this pattern and will be annotated with corresponding labels. 



C.1.3 Static Analysis for Data Partitioning 

Statistical inference can often be decomposed as independent subtasks on different portions of the data. Take 
the examples of classification in Section 5.1 for instance. The inference of the query relation winner (team) is 
"local" to each team constant (Assume label is the evidence relation). In other words, deciding whether one 
team is a winner does not rely on the decision of another team, team 1 \ in this classification subtask. Therefore, 
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if there are a total of n teams, we will have an opportunity to solve this subtask using n concurrent threads. 
Another example is labeling, which is often local to small units of sequences (e.g., sentences). 

In Felix, we borrow ideas from the Datalog literature |39 that uses linear programming to perform static 
analysis to decompose the data. Felix adopts the same algorithm of Seib and Larsen [39] . 

Consider an operator with query relation R(x). Different instances of x may depend on each other during 
inference. For example, consider the rule 



Intuitively, all instances of x and y that appear in the same rule cannot be solved independently since 
R(x) and R(y) are inter-dependent. Such dependency relationships are transitive, and we want to compute 
them so that data partitioning wouldn't violate them. A straightforward approach is to ground all rules and 
then perform component detection on the resultant graph. But grounding tends to be very computationally 
demanding. A cheaper way is static analysis that looks at the rules only. Specifically, one solution is to find a 
function /#(— ) which has /r(x) = fn(y) for all x and ?/'s that rely on each other. As we rely on static analysis 
to find the above condition should hold for all possible database instances. 

Assuming each constant is encoded as an integer in Felix, we may consider functions fn of the form [39] : 



where are integer constants. 

Following [39] , Felix uses linear programming to find such that /r(— ) satisfy the above constraints. 
Once we have such a partitioning function over the input, we can process the data in parallel. For example, if 
we want to run N concurrent threads for R, we could assign all data satisfying 



to the 2 thread. 

C.2 Operators Implementation 

Recall that Felix selects physical implementations for each logical operator to actually execute them. In this 
section, we show a handful of physical implementations for these operators. Each of these physical imple- 
mentations only works for a subset of the operator configurations. For cases not covered by these physical 
implementations, we can always use Tuffy or Gauss-Seidel-Style implementations 1 30 1 . 

Using Logistic Regression for Classification Operators Consider a Classification operator with a query 
relation v), where k is the key. Recall that each possible value of k corresponds to an independent 
classification task. The (ground) rules of this operator are all non-recursive with respect to i?, and so can be 
grouped by value of k. Specifically, for each value pair k and v, define 



R(x) <=R(y),T(z,y). 




f R (x u ...,x n ) mod N = j 



TL^v = {ri\ri is violated when R(k,v) is true } 
TZj^ ± = {ri\ri is violated when Vi; v) is false} 



and 



which intuitively summarizes the penalty we have to pay for assigning x for the key k. 
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With the above notation, one can check that 



exp{-W k J 
Pr[i?(fc, x) is true] " x 



where both x and y range over the domain of v plus _L, and R(k, _L) means R{k,v) is false for all values of v. 
This is implemented using SQL aggregation in a straightforward manner. 



Using Conditional Random Field for Correlated Classification Operators The Labeling operator 
generalizes the Classification operator by allowing tree-shaped correlations between the individual classification 
tasks. For simplicity, assume that such tree-shaped correlation is actually a chain. Specifically, suppose the 
possible values of k are fci, . . . , k m . Then in addition to the ground rules as described in the previous paragraph, 
we also have a set of recursive rules each containing R(ki, — ) and — ) for some 1 < i < m — 1. Define 

{ri\ri contains R(ki, — ) and J?(^+i, — )} 
^ cost rt ({R(ki, Vi), R(ki+i, Vi+i)}). 

r i^ n k- k-^ 

Then it's easy to show that 

Pr[{R(ki,Vi),l <i<m}] oc exp{- ^ W kuVi - W KM+M^+^ 

l<i<m l<i<m— 1 

which is exactly a linear-chain CRF. 

Again, Felix uses SQL to compute the above intermediate statistics, and then resort to the Viterbi algo- 
rithm [24] (for MAP inference) or the sum-product algorithm |46| (for marginal inference). 



Wl ki+1 ( Vi ,v i+1 ) 



Using Correlation Clustering for Coreference Operators The Coref operator can be implemented using 
correlation clustering |5|. We show that the constant-approximation algorithm for correlation clustering carries 
over to MLNs under some technical conditions. Recall that correlation clustering essentially performs node 
partitioning based on the edge weights in an undirected graph. We use the following example to illustrate the 
direct connection between MLN rules and correlation clustering. 



Example 1 Consider the following ground rules which are similar to those in Section |5T 
10 inSameDoc(Pi,P 2 ), sameString(Pi,P 2 ) => coRef(Pi,P 2 ) 
5 inSameDoc(Pi,P 2 ), subString(Pi,P 2 ) => coRef(Pi,P 2 ) 
5 inSameDoc(P 3 , P 4 ), subString(P 3 , P 4 ) => coRef (P 3 , P 4 ) 

Assume coRef is the query relation in this Coreference operator. We can construct the weighted graph 
as follows. The vertex set is V = {Pi, P2, ^3, Pa}- There are two edges with non-zero weight: (P l5 P2) with 
weight 15 and (P3,P 4 ) with weight 5. Other edges all have weight 0. The following proposition shows that the 
correlation clustering algorithm solves an equivalent optimization problem as the MAP inference in MLNs. 

Proposition C.l. LetT(xi) be a part ofT corresponding to a coref subt ask; let Gi be the correlation clustering 
problem transformed from T(xi) using the above procedure. Then an optimal solution to Gi is also an optimal 
solution to T(xi). 

We implement Arasu et al. [5] for correlation clustering. The theorem below shows that, for a certain family 
of MLN programs, the algorithm implemented in Felix actually performs approximate MLN inference. 

Theorem C.l. LetT(xi) be a coref subtask with rules generating a complete graph where each edge has a weight 
of either ±00 or w s.t. m < \w\ < M for some m, M > 0. Then the correlation clustering algorithm running 
on T(xi) is a -approximation algorithm in terms of the log-likelihood of the output world. 



34 



Proof. In Arasu et al. p], it was shown that for the case m — M, their algorithm achieves an approximation 
ratio of 3. If we run the same algorithm, then in expectation the output violates no more than 30PT edges, 
where OPT is the number of violated edges in the optimal partition. Now with weighted edges, the optimal 
cost is at least mOPT, and the expected cost of the algorithm output is at most 3MOPT. Thus, the same 
algorithm achieves — approximation. □ 



C.3 Cost Model for Physical Optimization 



The cost model in Section [5.2| requires estimation of the individual terms in ExecCost. There are three com- 
ponents: (1) the materialization cost of each eager query, (2) the cost of lazily evaluating the query in terms of 
the materialized views, and (3) the number of times that the query will be executed (£). We consider them in 
turn. 

Computing (1), the subquery materialization cost Mat(Qi), is straightforward by using PostgreSQL's EX- 
PLAIN feature. As is common for many RDBMSs, the unit of PostgreSQL's query evaluation cost is not time, 
but instead an internal unit (roughly proportional to the cost of 1 I/O). Felix performs all calculations in this 
unit. 

Computing (2), the cost of a single incremental evaluation, is more involved: we do not have Qi actually 
materialized (and with indexes built), so we cannot directly measure Incg(Q / ) using PostgreSQL. For simplicity, 
consider a two-way decomposition of Q into Q\ and Q2. We consider two cases: (a) when Q2 is estimated to 
be larger than PostgreSQL assigned buffer, and (b) when Q2 is smaller (i.e. can fit in available memory). 

To perform this estimation in case (a), Felix makes a simplifying assumption that the Qi are joined together 
using index-nested loop join (we will build the index when we actually materialize the tables). Exploring 
clustering opportunities for Qi is future work. 

Then, we force the RDBMS to estimate the detailed costs of the plan V : (Jx'=a{Qi) M (J x'=a{Q2), where 
Qi and Q2 are views, x! — a is an assignment to the bound variables x! = x° in x. From the detailed cost 
estimation, we extract the following quantities: (1) r^: be the number of tuples from subquery (Tx(Qi)] (2) n: 
the number of tuples generated by V. We also estimate the cost a (in PostgreSQL's unit) of each I/O by asking 
PostgreSQL to estimate the cost of selections on some existing tables. 

Denote by cf = Incg(Q / ) the cost (in PostgreSQL unit) of executing a x f = a(Ri) x crx'=a(R2), where Ri is 
the materialized table of Qi with proper indexes built. Without loss of generality, assume rt\ < ri2 and that rt\ 
is small enough so that n in the above query is executed using nested loop join. On average, for each of the 
estimated rt\ tuples in cr x (Ri), there is one index access to i?2, and [^-] tuples in cr x {R2) that can be joined; 
assume each of the [^-] tuples from R2 requires one disk page I/O. Thus, there are ni\^\ disk accesses to 
retrieve the tuples from i?2, and 



c f — an\ 



r-i+io g |Q 2 

ni 



(11) 



where we use log IQ2I as the cost of one index access to R2 (height of a B-tree). Now both d — Incg((5 / ) and 
Mat(Qi) are in the unit of PostgreSQL cost, we can sum them together, and compare with the estimation on 
other materialization plans. 

In case (b), when Q2 can fit in memory, we found that the above estimation tends to be too conservative 
- many accesses to Q2 are cache hits whereas the model above still counts the accesses into disk I/O. To 
compensate for this difference, we multiply d (derived above) with a fudge factor (3 < 1. Intuitively, we choose 
f3 as the ratio of accessing a page in main memory versus accessing a page on disk. We empirically determine 

p. 

Component (3) is the factor t, which is dependent on the statistical operator. However, we can often 
derive an estimation method from the algorithm inside the operator. For example, for the algorithm in 15], the 
number of requests to an input data movement operator can be estimated by the total number of mentions 
(using COUNT) divided by the expected average node degree. 
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D Additional Experiments 

D.l Additional Experiments of High-level Scalability and Quality 

We describe the detailed methodology in our experiments on the Enron-R, DBLife, and NFL datasets. 



Enron-R The MLN program for Enron-R was based on the rules obtained from related publications on rule- 
based information extraction 25,27]. These rules (i.e., "Rule Set 1" in Figure [5]) use dictionaries for person 
name extraction, and regular expressions for phone number extraction. To extract person-phone relationships, 
a fixed window size is used to identify person-phone co-occurrences. We vary this window size to produce a 
precision-recall curve of this rule-based approach. 

The MLN program used by Felix, Tuffy, and Alchemy replaces the above rules' relation extraction part 
(using the same entity extraction results) with a statistical counter-part: Instead of fixed window sizes, this 
program uses MLN rule weights to encode the strength of co-occurrence and thereby confidence in person- 
phone relationships. In addition, we write soft constraints such as "a phone number cannot be associated with 
too many persons." We add in a set of coreference rules to perform person coref. We run Alchemy, Tuffy 
and Felix on this program. 



DBLife The MLN program for DBLife was based on the rules in ClMPLE |12 , which identifies person and 
organization mentions using dictionaries with regular expression variations (e.g., abbreviations, titles). In case 
of an ambiguous mention such as "J. Smith", ClMPLE binds it to an arbitrary name in its dictionary that is 
compatible (e.g., "John Smith"). ClMPLE then uses a proximity-based formula to translate person-organization 
co-occurrences into ranked affiliation tuples. These form "Rule Set 2" as in Figure [5j 

The MLN program is constructed as follows. We first extract entities from the corpus. We perform part- 
of-speech tagging [38] on the raw text, and then identify possible person/organization names using simple 
heuristics (e.g., common person name dictionaries and keywords such as "University"). To handle noise in the 
entity extraction results, our MLN program performs both affiliation extraction and coref resolution using ideas 
similar to Figure [2] 

NFL On the NFL dataset, we extract winner- loser pairs. There are 1,100 sports news articles in the corpus. 
We obtain ground truth of game results from the web. As the baseline solution, we use 610 of the articles 
together with ground truth to train a CRF model that tags each token in the text as either WINNER, LOSER, 
or OTHER. We then apply this CRF model on the remaining 500 articles to generate probabilistic tagging of 
the tokens. Those 500 articles report on a different season of NFL games than the training articles, and we 
have ground truth on game results (in the form of winner- loser-date triples). We take the publication dates of 
the articles and align them to game dates. 

The MLN program on NFL consists of two parts. The first part contains MLN rules encoding the CRF 
model for winner/loser team mention extraction. The second part is adapted from the rules developed by a 
research team in the Machine Reading project. Those rules model simple domain knowledge such as "a winner 
cannot be a loser on the same day" and "a team cannot win twice on the same day." We also add in coreference 
of the team mentions. 





Coref 


Labeling 


Classification 


MLN Inference 


Enron-R 


1/1 


0/0 


0/0 


1/1 


DBLife 


2/2 


0/0 


1/1 


0/0 


NFL 


1/1 


1/1 


0/0 


1/1 


Programl 


0/0 


1/1 


0/0 


0/0 


Program2 


0/0 


0/0 


37/37 


0/0 


Program3 


0/0 


0/1 


0/0 


1/1 



Table 10: Specialized Operators Discovered by Felix's Compiler 
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Figure 7: Performance of II2 P-complete Algorithms for Non-recursive Programs 




Figure 8: Plan diagram of Felix's Cost Optimizer 



D.2 Coverage of the Compiler 

Since discovering subtasks as operators is crucial to Felix's scalability, in this section we test Felix's compiler. 
We first evaluate the heuristics we are using for discovering statistical operators given an MLN program. We 
then evaluate the performance of the H2 P-complete algorithm to discovering REF and SYM in non-recursive 
programs. 



Using Heuristics for Arbitrary MLN Programs While Felix's compiler can discover all Coref, Labeling, 
and Classification operators in all programs used in our experiments, we are also interested in how many 
operators Felix can discover from other programs. To test this, we download the programs that are available 
on Alchemy's Web site and manually label operators in these programs. We manually label a set of rules 
as an operator if this set of rules follows our definition of statistical operators. 

We then run Felix's compiler on these programs and compare the logical plans produced by Felix with 



our manual labels. We list all programs with manually labeled operators in Table 10, The x/y in each cell of 



Table 10 means that, among y manually labeled operators, Felix's compiler discovers x of them. 



We can see from Table [10| that Felix's compiler works well for the programs used in our experiment. Also, 
Felix works well on discovering classification and labeling operators in Alchemy's programs. This implies 
the set of heuristic rules we are using, although not complete, indeed encodes some popular patterns users may 
use in real world applications. Although some of Alchemy's programs encode coreference resolution tasks, 
none of them were labeled as coreference operator. This is because none of these programs explicitly declares 
the symmetric constraints as hard rules. Therefore, the set of possible worlds decided by the MLN program 



http : //alchemy . cs . Washington . edu/mlns/ 
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Figure 9: Convergence of Dual Decomposition 



is different from those decided by the typical "partitioning" -based semantics of coreference operators. How to 
detect and efficiently implement these "soft-coref" is an interesting topic for future work. 



Performance of I^P-complete Algorithm for Non-recursive Programs In Section 5J3 and Section 
|C.l.l| we show that there are I^P-complete algorithms for annotating REF and SYM properties. Felix imple- 
ments them. As the intractability is actually inherent in the number of non-distinguished variables, which is 
usually small, we are interested in understanding the performance of these algorithms. 

We start from one of the longest rules found in Alchemy's Web site which can be annotated as SYM. This 
rule has 3 non-distinguished variables. We then add more non-distinguished variables and plot the time used 
for each setting (Figure [7]). We can see that Felix uses less than 1 second to annotate the original rule, but 
exponentially more time when the number of non-distinguished variables grows to 10. This is not surprising 
due to the exponential complexity of this algorithm. Another interesting conclusion we can draw from Figure [7] 
is that, as long as the number of non-distinguished variables is less than 10 (which is usually the case in our 
programs), Felix performs reasonably efficiently. 



D.3 Stability of Cost Estimator 

In our previous experiments we show that the plan generated by Felix's cost optimizer contributes to the 
scalability of Felix. As the optimizer needs to estimate several parameters before performing any predictions, 
we are interested in the sensitivity of our current optimizer to the estimation errors of these parameters. 

The only two parameters used in Felix's optimizer are 1) the cost (in PostgreSQL's unit) of fetching one 
page from the disk and 2) the ratio of the speed between fetching one page from the memory and fetching one 
page from the disk. We test all combined settings of these two parameters (±100% of the estimated value) and 
draw the plan diagram of two queries in Figure [8| We represent different execution plans with different colors. 
For each point (x, y) in the plan diagram, the color of that point represents which execution plan the compiler 
chooses if the PostgreSQL's unit equals x and memory/IO ratio equals y. 

For those queries not shown in Figure |8j Felix produces the same plan for each tested parameter combi- 
nation. For queries shown in Figure [8j we can see Felix is robust for parameter mis-estimation. Actually, all 
the plans shown in Figure [8] are close to optimal, which implies that in our experiments Felix's cost optimizer 
avoids the selection of "extremely bad" plans even under serious mis-estimation of parameters. 



D.4 Convergence of Dual Decomposition 

Felix implements an iterative approach for dual decomposition. One immediate question is how many iterations 
do we need before the algorithm converges?. 



To gain some intuitions, we run Felix on the DBLife data set for a relative long time and record the 



number of updated Lagrangian multipliers of each iteration. We use constant step size A = 0.9. As shown in 



20 Similar phenomena occur in the NFL dataset as well. 
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Figure[9j even after more than 130 iterations, the Lagrangian multipliers are still under heavy updates. However, 
on the ENRON-R dataset, we observed that the whole process converges after the first several iterations! This 
implies that the convergence of our operator-based framework depends on the underlying MLN program and 
the size of the input data. It is interesting to see how different techniques on dual decomposition and gradient 
methods can alleviate this convergence issue, which we leave as future work. 

Fortunately, we empirically find that in all of our experiments, taking the result from the first several 
iterations is often a reasonable trade-off between time and quality - all P/R curves in the previous experiments 
are generated by taking the last iteration within 3000 seconds and we already get significant improvements 
compared to baseline solutions. In Felix, to allow users to directly trade-off between quality and performance, 
we provide two modes: 1) Only run the first iteration and flush the result immediately; and 2) Run the number 
of iterations specified by the user. It is an interesting direction to explore the possibility of automatically 
selecting parameters for dual decomposition. 
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